evalloop 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,29 @@
1
+ name: Publish to PyPI
2
+
3
+ on:
4
+ push:
5
+ tags:
6
+ - "v*"
7
+
8
+ jobs:
9
+ test:
10
+ runs-on: ubuntu-latest
11
+ steps:
12
+ - uses: actions/checkout@v4
13
+ - uses: actions/setup-python@v5
14
+ with:
15
+ python-version: "3.11"
16
+ - run: pip install -e ".[dev]" && python -m pytest tests/ -q
17
+
18
+ publish:
19
+ needs: test
20
+ runs-on: ubuntu-latest
21
+ permissions:
22
+ id-token: write # required for PyPI Trusted Publishing (no token needed)
23
+ steps:
24
+ - uses: actions/checkout@v4
25
+ - uses: actions/setup-python@v5
26
+ with:
27
+ python-version: "3.11"
28
+ - run: pip install build && python -m build
29
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,9 @@
1
+ __pycache__/
2
+ *.pyc
3
+ *.pyo
4
+ .pytest_cache/
5
+ *.egg-info/
6
+ dist/
7
+ build/
8
+ .evalloop/
9
+ FEEDBACK.md
@@ -0,0 +1,319 @@
1
+ Metadata-Version: 2.4
2
+ Name: evalloop
3
+ Version: 0.1.1
4
+ Summary: Closed-loop eval monitoring for LLM-powered products
5
+ License: MIT
6
+ Requires-Python: >=3.9
7
+ Requires-Dist: click>=8.0.0
8
+ Provides-Extra: anthropic
9
+ Requires-Dist: anthropic>=0.40.0; extra == 'anthropic'
10
+ Provides-Extra: dev
11
+ Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
12
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
13
+ Provides-Extra: openai
14
+ Requires-Dist: openai>=1.0.0; extra == 'openai'
15
+ Provides-Extra: voyage
16
+ Requires-Dist: voyageai>=0.2.4; extra == 'voyage'
17
+ Description-Content-Type: text/markdown
18
+
19
+ # evalloop
20
+
21
+ **Sentry for AI behavior.** Closed-loop eval monitoring for LLM-powered products.
22
+
23
+ You changed your prompt on Friday. Your bot broke. Your users noticed before you did.
24
+
25
+ evalloop wraps your LLM client with one line. Every call is scored against your known-good baselines in the background — zero added latency. Regressions surface in your terminal before they reach your users.
26
+
27
+ ---
28
+
29
+ ## Who this is for
30
+
31
+ **You're building an AI product** — a chatbot, support bot, summarizer, coding assistant. You're making prompt changes regularly. You have no eval system. You've shipped a regression at least once and found out from a user.
32
+
33
+ | If you're... | Your pain | evalloop gives you... |
34
+ |---|---|---|
35
+ | A solo founder shipping fast | You're the only engineer. When a prompt change breaks your bot over a weekend, users notice before you do. You have no eval infra — no time to build one. | Score trends after every call so regressions surface in 2 minutes, not Monday morning |
36
+ | An AI engineer at a startup | You improve prompts weekly but can't prove quality went up. Your PM asks "did the last change make it better?" and you have no answer. | A shareable eval report — timestamped scores by prompt version — you can show your PM exactly what changed and when |
37
+ | An ML engineer with internal LLM tools | Internal tools have no user feedback loop. If your tool degrades, users just quietly stop using it. You'd never know without monitoring. | Automated regression alerts so quality drops surface before users abandon the tool |
38
+
39
+ **Before evalloop:**
40
+ ```
41
+ Prompt change pushed Friday 5pm
42
+ → Users complain Monday morning
43
+ → 4 hours debugging which change broke what
44
+ → Average detection time: 3 days
45
+ ```
46
+
47
+ **After evalloop:**
48
+ ```
49
+ Prompt change pushed Friday 5pm
50
+ → evalloop watch alerts Friday 5:02pm
51
+ → Developer reverts before closing laptop
52
+ → Average detection time: 2 minutes
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Install
58
+
59
+ ```bash
60
+ pip install evalloop
61
+ ```
62
+
63
+ Requires Python 3.9+. Then run the setup wizard:
64
+
65
+ ```bash
66
+ evalloop init
67
+ ```
68
+
69
+ evalloop auto-detects your scoring backend from environment variables — no extra config needed:
70
+
71
+ | Keys you have | Scoring backend | Accuracy |
72
+ |---|---|---|
73
+ | `VOYAGE_API_KEY` | Semantic embeddings via Voyage AI | Best |
74
+ | `ANTHROPIC_API_KEY` | LLM-as-judge via claude-haiku | Good |
75
+ | Neither | Heuristics only (length/format) | Basic |
76
+
77
+ If you're already wrapping an Anthropic or OpenAI client, evalloop uses your existing key automatically — no extra signup required.
78
+
79
+ ---
80
+
81
+ ## Quickstart
82
+
83
+ ```python
84
+ from evalloop import wrap
85
+ import anthropic
86
+
87
+ # One line change — evalloop uses your existing ANTHROPIC_API_KEY automatically
88
+ client = wrap(anthropic.Anthropic(), task_tag="qa")
89
+
90
+ resp = client.messages.create(
91
+ model="claude-haiku-4-5-20251001",
92
+ max_tokens=256,
93
+ messages=[{"role": "user", "content": "What is the capital of France?"}],
94
+ )
95
+ # ^ scored against your baselines in the background. Zero added latency.
96
+ ```
97
+
98
+ Then check your dashboard:
99
+
100
+ ```
101
+ $ evalloop status
102
+
103
+ 🟢 [qa]
104
+ Calls captured : 142
105
+ Scored : 142
106
+ Avg (7d) : 0.81
107
+ Avg (24h) : 0.79
108
+ Trend (recent) : ▆▆▇▇▆▅▆▇▇▆▇▆▇▇▆▆▇▆▇▆
109
+ ```
110
+
111
+ Or watch for regressions automatically:
112
+
113
+ ```
114
+ $ evalloop watch --interval 60
115
+
116
+ evalloop watch — polling every 60s. Ctrl-C to stop.
117
+
118
+ 🔴 [qa]
119
+ Calls captured : 201
120
+ Avg (7d) : 0.81
121
+ Avg (24h) : 0.71
122
+ ⚠ Regression : score dropped 10.0pp in last 24h
123
+
124
+ ^C
125
+ Watch stopped.
126
+ ```
127
+
128
+ ---
129
+
130
+ ## How it works
131
+
132
+ ```
133
+ your call
134
+
135
+
136
+ wrap(client).messages.create()
137
+ │ returns immediately (zero latency added)
138
+
139
+ └─▶ background thread
140
+
141
+ ├─▶ infer task tag (from system prompt, if not set)
142
+ ├─▶ score(output, baselines)
143
+ │ ├─▶ heuristics (empty, too short, too long)
144
+ │ └─▶ cosine similarity via Voyage AI
145
+ └─▶ sqlite insert → ~/.evalloop/calls.db
146
+ ```
147
+
148
+ - **Zero latency** — scoring runs in a background thread
149
+ - **Silent on errors** — disk full, API down, DB locked → log to stderr, never crash your app
150
+ - **Graceful exit** — on normal Python exit, flushes the queue so no captures are lost
151
+ - **Degraded mode** — if Voyage AI is unavailable, heuristic-only scoring continues (flags `degraded_mode`)
152
+ - **Self-hosted** — all data stays local in `~/.evalloop/`
153
+
154
+ ---
155
+
156
+ ## Auto tag inference
157
+
158
+ If you don't set `task_tag`, evalloop infers it from your system prompt:
159
+
160
+ ```python
161
+ client = wrap(anthropic.Anthropic()) # no task_tag needed
162
+
163
+ resp = client.messages.create(
164
+ model="claude-haiku-4-5-20251001",
165
+ max_tokens=256,
166
+ system="You are a helpful assistant that summarizes documents.",
167
+ # ↑ evalloop detects → task_tag="summarization"
168
+ messages=[{"role": "user", "content": "Summarize this article..."}],
169
+ )
170
+ ```
171
+
172
+ Supported task types for auto-inference: `qa`, `summarization`, `code`, `customer-service`, `classification`
173
+
174
+ ---
175
+
176
+ ## CLI commands
177
+
178
+ ```bash
179
+ # First-time setup — detect keys, choose scoring backend, install baselines
180
+ evalloop init
181
+
182
+ # Score trends for all tags
183
+ evalloop status
184
+
185
+ # Filter to one tag
186
+ evalloop status --tag qa
187
+
188
+ # Watch for regressions (polls every 60s by default)
189
+ evalloop watch
190
+ evalloop watch --tag qa --interval 30
191
+
192
+ # Export calls + scores for sharing or analysis
193
+ evalloop export # JSON to stdout
194
+ evalloop export --format csv -o out.csv # CSV to file
195
+ evalloop export --tag qa --limit 500 # filtered export
196
+
197
+ # Manage baselines
198
+ evalloop baseline add "your good output" --tag my-task
199
+ evalloop baseline list
200
+ evalloop baseline install # install all built-in defaults
201
+ evalloop baseline install --tag qa # install for one tag
202
+ evalloop baseline install --overwrite # replace existing defaults
203
+
204
+ # List available built-in task types
205
+ evalloop defaults
206
+ ```
207
+
208
+ ---
209
+
210
+ ## Scoring
211
+
212
+ evalloop uses **consistency-first scoring**: deterministic heuristics run first, then the best available semantic backend.
213
+
214
+ | Signal | What it catches |
215
+ |--------|----------------|
216
+ | Empty / whitespace output | Total failures |
217
+ | Too short (< 15% of baseline length) | Truncation |
218
+ | Too long (> 5x baseline length) | Rambling, prompt injection |
219
+ | Cosine similarity to baseline centroid | Semantic drift (Voyage backend) |
220
+ | LLM-as-judge rating | Quality regression (Anthropic backend) |
221
+ | `degraded_mode` flag | No scoring backend available — heuristics only |
222
+
223
+ Backend is auto-detected from your environment. Run `evalloop init` to configure.
224
+
225
+ Scores range 0.0–1.0. A score below your 7-day average by >5pp triggers a regression flag.
226
+
227
+ ---
228
+
229
+ ## Baselines
230
+
231
+ Baselines are your known-good outputs. evalloop ships with curated defaults for common task types — you get meaningful scores immediately on install.
232
+
233
+ ```bash
234
+ # Install defaults for all task types
235
+ evalloop baseline install
236
+
237
+ # Add your own known-good output
238
+ evalloop baseline add "The capital of France is Paris." --tag qa
239
+
240
+ # List all tags with baselines
241
+ evalloop baseline list
242
+ ```
243
+
244
+ Built-in task types: `qa`, `summarization`, `code`, `customer-service`, `classification`
245
+
246
+ ---
247
+
248
+ ## OpenAI support
249
+
250
+ ```python
251
+ from evalloop import wrap
252
+ import openai
253
+
254
+ # pip install "evalloop[openai]"
255
+ client = wrap(openai.OpenAI(), task_tag="summarization")
256
+
257
+ resp = client.chat.completions.create(
258
+ model="gpt-4o",
259
+ messages=[
260
+ {"role": "system", "content": "You summarize documents."},
261
+ {"role": "user", "content": "Summarize: ..."},
262
+ ],
263
+ )
264
+ ```
265
+
266
+ ---
267
+
268
+ ## Data & Privacy
269
+
270
+ evalloop stores the following locally in `~/.evalloop/calls.db`:
271
+
272
+ | Field | Stored by default | With `store_inputs=False` |
273
+ |-------|:-----------------:|:------------------------:|
274
+ | Timestamp, model, latency | ✅ | ✅ |
275
+ | Output text | ✅ | ✅ |
276
+ | Score, flags, confidence | ✅ | ✅ |
277
+ | Input messages (may contain PII) | ✅ | ❌ |
278
+
279
+ For PII-sensitive environments (HIPAA, GDPR), opt out of input storage:
280
+
281
+ ```python
282
+ client = wrap(anthropic.Anthropic(), task_tag="support", store_inputs=False)
283
+ # Input messages are never written to disk. Output + scores are still captured.
284
+ ```
285
+
286
+ All data stays on your machine. evalloop has no cloud component.
287
+
288
+ ---
289
+
290
+ ## Export & share
291
+
292
+ Share your eval data with your team:
293
+
294
+ ```bash
295
+ # Export last 1000 calls as JSON
296
+ evalloop export > eval-report.json
297
+
298
+ # Export as CSV for Excel/Sheets
299
+ evalloop export --format csv -o eval-report.csv
300
+
301
+ # Export a specific tag
302
+ evalloop export --tag qa --limit 500
303
+ ```
304
+
305
+ ---
306
+
307
+ ## Architecture
308
+
309
+ - **Storage**: SQLite at `~/.evalloop/calls.db` (WAL mode — concurrent reads while writing)
310
+ - **Baselines**: JSONL files at `~/.evalloop/baselines/<tag>.jsonl`
311
+ - **Scoring backends**: Voyage AI `voyage-3-lite` (semantic embeddings) or Claude Haiku (LLM-as-judge) — auto-detected from env vars
312
+ - **Score model provenance**: stored per-row so history remains interpretable if backend changes
313
+ - **No cloud dependency** — everything runs locally
314
+
315
+ ---
316
+
317
+ ## License
318
+
319
+ MIT
@@ -0,0 +1,301 @@
1
+ # evalloop
2
+
3
+ **Sentry for AI behavior.** Closed-loop eval monitoring for LLM-powered products.
4
+
5
+ You changed your prompt on Friday. Your bot broke. Your users noticed before you did.
6
+
7
+ evalloop wraps your LLM client with one line. Every call is scored against your known-good baselines in the background — zero added latency. Regressions surface in your terminal before they reach your users.
8
+
9
+ ---
10
+
11
+ ## Who this is for
12
+
13
+ **You're building an AI product** — a chatbot, support bot, summarizer, coding assistant. You're making prompt changes regularly. You have no eval system. You've shipped a regression at least once and found out from a user.
14
+
15
+ | If you're... | Your pain | evalloop gives you... |
16
+ |---|---|---|
17
+ | A solo founder shipping fast | You're the only engineer. When a prompt change breaks your bot over a weekend, users notice before you do. You have no eval infra — no time to build one. | Score trends after every call so regressions surface in 2 minutes, not Monday morning |
18
+ | An AI engineer at a startup | You improve prompts weekly but can't prove quality went up. Your PM asks "did the last change make it better?" and you have no answer. | A shareable eval report — timestamped scores by prompt version — you can show your PM exactly what changed and when |
19
+ | An ML engineer with internal LLM tools | Internal tools have no user feedback loop. If your tool degrades, users just quietly stop using it. You'd never know without monitoring. | Automated regression alerts so quality drops surface before users abandon the tool |
20
+
21
+ **Before evalloop:**
22
+ ```
23
+ Prompt change pushed Friday 5pm
24
+ → Users complain Monday morning
25
+ → 4 hours debugging which change broke what
26
+ → Average detection time: 3 days
27
+ ```
28
+
29
+ **After evalloop:**
30
+ ```
31
+ Prompt change pushed Friday 5pm
32
+ → evalloop watch alerts Friday 5:02pm
33
+ → Developer reverts before closing laptop
34
+ → Average detection time: 2 minutes
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Install
40
+
41
+ ```bash
42
+ pip install evalloop
43
+ ```
44
+
45
+ Requires Python 3.9+. Then run the setup wizard:
46
+
47
+ ```bash
48
+ evalloop init
49
+ ```
50
+
51
+ evalloop auto-detects your scoring backend from environment variables — no extra config needed:
52
+
53
+ | Keys you have | Scoring backend | Accuracy |
54
+ |---|---|---|
55
+ | `VOYAGE_API_KEY` | Semantic embeddings via Voyage AI | Best |
56
+ | `ANTHROPIC_API_KEY` | LLM-as-judge via claude-haiku | Good |
57
+ | Neither | Heuristics only (length/format) | Basic |
58
+
59
+ If you're already wrapping an Anthropic or OpenAI client, evalloop uses your existing key automatically — no extra signup required.
60
+
61
+ ---
62
+
63
+ ## Quickstart
64
+
65
+ ```python
66
+ from evalloop import wrap
67
+ import anthropic
68
+
69
+ # One line change — evalloop uses your existing ANTHROPIC_API_KEY automatically
70
+ client = wrap(anthropic.Anthropic(), task_tag="qa")
71
+
72
+ resp = client.messages.create(
73
+ model="claude-haiku-4-5-20251001",
74
+ max_tokens=256,
75
+ messages=[{"role": "user", "content": "What is the capital of France?"}],
76
+ )
77
+ # ^ scored against your baselines in the background. Zero added latency.
78
+ ```
79
+
80
+ Then check your dashboard:
81
+
82
+ ```
83
+ $ evalloop status
84
+
85
+ 🟢 [qa]
86
+ Calls captured : 142
87
+ Scored : 142
88
+ Avg (7d) : 0.81
89
+ Avg (24h) : 0.79
90
+ Trend (recent) : ▆▆▇▇▆▅▆▇▇▆▇▆▇▇▆▆▇▆▇▆
91
+ ```
92
+
93
+ Or watch for regressions automatically:
94
+
95
+ ```
96
+ $ evalloop watch --interval 60
97
+
98
+ evalloop watch — polling every 60s. Ctrl-C to stop.
99
+
100
+ 🔴 [qa]
101
+ Calls captured : 201
102
+ Avg (7d) : 0.81
103
+ Avg (24h) : 0.71
104
+ ⚠ Regression : score dropped 10.0pp in last 24h
105
+
106
+ ^C
107
+ Watch stopped.
108
+ ```
109
+
110
+ ---
111
+
112
+ ## How it works
113
+
114
+ ```
115
+ your call
116
+
117
+
118
+ wrap(client).messages.create()
119
+ │ returns immediately (zero latency added)
120
+
121
+ └─▶ background thread
122
+
123
+ ├─▶ infer task tag (from system prompt, if not set)
124
+ ├─▶ score(output, baselines)
125
+ │ ├─▶ heuristics (empty, too short, too long)
126
+ │ └─▶ cosine similarity via Voyage AI
127
+ └─▶ sqlite insert → ~/.evalloop/calls.db
128
+ ```
129
+
130
+ - **Zero latency** — scoring runs in a background thread
131
+ - **Silent on errors** — disk full, API down, DB locked → log to stderr, never crash your app
132
+ - **Graceful exit** — on normal Python exit, flushes the queue so no captures are lost
133
+ - **Degraded mode** — if Voyage AI is unavailable, heuristic-only scoring continues (flags `degraded_mode`)
134
+ - **Self-hosted** — all data stays local in `~/.evalloop/`
135
+
136
+ ---
137
+
138
+ ## Auto tag inference
139
+
140
+ If you don't set `task_tag`, evalloop infers it from your system prompt:
141
+
142
+ ```python
143
+ client = wrap(anthropic.Anthropic()) # no task_tag needed
144
+
145
+ resp = client.messages.create(
146
+ model="claude-haiku-4-5-20251001",
147
+ max_tokens=256,
148
+ system="You are a helpful assistant that summarizes documents.",
149
+ # ↑ evalloop detects → task_tag="summarization"
150
+ messages=[{"role": "user", "content": "Summarize this article..."}],
151
+ )
152
+ ```
153
+
154
+ Supported task types for auto-inference: `qa`, `summarization`, `code`, `customer-service`, `classification`
155
+
156
+ ---
157
+
158
+ ## CLI commands
159
+
160
+ ```bash
161
+ # First-time setup — detect keys, choose scoring backend, install baselines
162
+ evalloop init
163
+
164
+ # Score trends for all tags
165
+ evalloop status
166
+
167
+ # Filter to one tag
168
+ evalloop status --tag qa
169
+
170
+ # Watch for regressions (polls every 60s by default)
171
+ evalloop watch
172
+ evalloop watch --tag qa --interval 30
173
+
174
+ # Export calls + scores for sharing or analysis
175
+ evalloop export # JSON to stdout
176
+ evalloop export --format csv -o out.csv # CSV to file
177
+ evalloop export --tag qa --limit 500 # filtered export
178
+
179
+ # Manage baselines
180
+ evalloop baseline add "your good output" --tag my-task
181
+ evalloop baseline list
182
+ evalloop baseline install # install all built-in defaults
183
+ evalloop baseline install --tag qa # install for one tag
184
+ evalloop baseline install --overwrite # replace existing defaults
185
+
186
+ # List available built-in task types
187
+ evalloop defaults
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Scoring
193
+
194
+ evalloop uses **consistency-first scoring**: deterministic heuristics run first, then the best available semantic backend.
195
+
196
+ | Signal | What it catches |
197
+ |--------|----------------|
198
+ | Empty / whitespace output | Total failures |
199
+ | Too short (< 15% of baseline length) | Truncation |
200
+ | Too long (> 5x baseline length) | Rambling, prompt injection |
201
+ | Cosine similarity to baseline centroid | Semantic drift (Voyage backend) |
202
+ | LLM-as-judge rating | Quality regression (Anthropic backend) |
203
+ | `degraded_mode` flag | No scoring backend available — heuristics only |
204
+
205
+ Backend is auto-detected from your environment. Run `evalloop init` to configure.
206
+
207
+ Scores range 0.0–1.0. A score below your 7-day average by >5pp triggers a regression flag.
208
+
209
+ ---
210
+
211
+ ## Baselines
212
+
213
+ Baselines are your known-good outputs. evalloop ships with curated defaults for common task types — you get meaningful scores immediately on install.
214
+
215
+ ```bash
216
+ # Install defaults for all task types
217
+ evalloop baseline install
218
+
219
+ # Add your own known-good output
220
+ evalloop baseline add "The capital of France is Paris." --tag qa
221
+
222
+ # List all tags with baselines
223
+ evalloop baseline list
224
+ ```
225
+
226
+ Built-in task types: `qa`, `summarization`, `code`, `customer-service`, `classification`
227
+
228
+ ---
229
+
230
+ ## OpenAI support
231
+
232
+ ```python
233
+ from evalloop import wrap
234
+ import openai
235
+
236
+ # pip install "evalloop[openai]"
237
+ client = wrap(openai.OpenAI(), task_tag="summarization")
238
+
239
+ resp = client.chat.completions.create(
240
+ model="gpt-4o",
241
+ messages=[
242
+ {"role": "system", "content": "You summarize documents."},
243
+ {"role": "user", "content": "Summarize: ..."},
244
+ ],
245
+ )
246
+ ```
247
+
248
+ ---
249
+
250
+ ## Data & Privacy
251
+
252
+ evalloop stores the following locally in `~/.evalloop/calls.db`:
253
+
254
+ | Field | Stored by default | With `store_inputs=False` |
255
+ |-------|:-----------------:|:------------------------:|
256
+ | Timestamp, model, latency | ✅ | ✅ |
257
+ | Output text | ✅ | ✅ |
258
+ | Score, flags, confidence | ✅ | ✅ |
259
+ | Input messages (may contain PII) | ✅ | ❌ |
260
+
261
+ For PII-sensitive environments (HIPAA, GDPR), opt out of input storage:
262
+
263
+ ```python
264
+ client = wrap(anthropic.Anthropic(), task_tag="support", store_inputs=False)
265
+ # Input messages are never written to disk. Output + scores are still captured.
266
+ ```
267
+
268
+ All data stays on your machine. evalloop has no cloud component.
269
+
270
+ ---
271
+
272
+ ## Export & share
273
+
274
+ Share your eval data with your team:
275
+
276
+ ```bash
277
+ # Export last 1000 calls as JSON
278
+ evalloop export > eval-report.json
279
+
280
+ # Export as CSV for Excel/Sheets
281
+ evalloop export --format csv -o eval-report.csv
282
+
283
+ # Export a specific tag
284
+ evalloop export --tag qa --limit 500
285
+ ```
286
+
287
+ ---
288
+
289
+ ## Architecture
290
+
291
+ - **Storage**: SQLite at `~/.evalloop/calls.db` (WAL mode — concurrent reads while writing)
292
+ - **Baselines**: JSONL files at `~/.evalloop/baselines/<tag>.jsonl`
293
+ - **Scoring backends**: Voyage AI `voyage-3-lite` (semantic embeddings) or Claude Haiku (LLM-as-judge) — auto-detected from env vars
294
+ - **Score model provenance**: stored per-row so history remains interpretable if backend changes
295
+ - **No cloud dependency** — everything runs locally
296
+
297
+ ---
298
+
299
+ ## License
300
+
301
+ MIT
@@ -0,0 +1,4 @@
1
+ from evalloop.capture import wrap
2
+ from evalloop.scorer import Score, score
3
+
4
+ __all__ = ["wrap", "Score", "score"]
@@ -0,0 +1,15 @@
1
+ """
2
+ _utils.py — shared internal helpers.
3
+ """
4
+
5
+ from __future__ import annotations
6
+
7
+ import sys
8
+
9
+
10
+ def _warn(msg: str) -> None:
11
+ """Write a warning to stderr. Never raises."""
12
+ try:
13
+ print(msg, file=sys.stderr)
14
+ except Exception: # noqa: BLE001
15
+ pass