vowel-optimization 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- vowel_optimization-0.1.0/LICENSE +21 -0
- vowel_optimization-0.1.0/PKG-INFO +424 -0
- vowel_optimization-0.1.0/README.md +398 -0
- vowel_optimization-0.1.0/pyproject.toml +52 -0
- vowel_optimization-0.1.0/setup.cfg +4 -0
- vowel_optimization-0.1.0/src/vowel_optimization/__init__.py +3 -0
- vowel_optimization-0.1.0/src/vowel_optimization/__main__.py +7 -0
- vowel_optimization-0.1.0/src/vowel_optimization/adapter.py +247 -0
- vowel_optimization-0.1.0/src/vowel_optimization/definitions.py +358 -0
- vowel_optimization-0.1.0/src/vowel_optimization/functions.py +43 -0
- vowel_optimization-0.1.0/src/vowel_optimization/run_optimization.py +292 -0
- vowel_optimization-0.1.0/src/vowel_optimization/task.py +589 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/PKG-INFO +424 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/SOURCES.txt +16 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/dependency_links.txt +1 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/entry_points.txt +2 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/requires.txt +9 -0
- vowel_optimization-0.1.0/src/vowel_optimization.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Mert
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,424 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: vowel-optimization
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: GEPA-powered prompt optimization for vowel eval spec generation
|
|
5
|
+
Author-email: Mert <your.email@example.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Keywords: llm,testing,optimization,prompt-engineering,GEPA
|
|
8
|
+
Classifier: Development Status :: 3 - Alpha
|
|
9
|
+
Classifier: Intended Audience :: Developers
|
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Classifier: Topic :: Software Development :: Testing
|
|
14
|
+
Requires-Python: >=3.11
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
License-File: LICENSE
|
|
17
|
+
Requires-Dist: vowel>=0.2.7
|
|
18
|
+
Requires-Dist: GEPA
|
|
19
|
+
Requires-Dist: pydantic-ai
|
|
20
|
+
Requires-Dist: logfire
|
|
21
|
+
Requires-Dist: pyyaml>=6.0
|
|
22
|
+
Provides-Extra: dev
|
|
23
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
24
|
+
Requires-Dist: ruff>=0.1.0; extra == "dev"
|
|
25
|
+
Dynamic: license-file
|
|
26
|
+
|
|
27
|
+
# vowel-optimization
|
|
28
|
+
|
|
29
|
+
**GEPA-powered prompt optimization playground for vowel eval spec generation.**
|
|
30
|
+
|
|
31
|
+
This repository contains a research tool for optimizing vowel's `EVAL_SPEC_CONTEXT` prompt using [GEPA](https://github.com/GEPA-ai/GEPA) (Genetic Pareto). By iteratively generating evaluation specs and measuring their quality against ground-truth function implementations, it discovers improved prompts that produce better test cases.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Overview
|
|
36
|
+
|
|
37
|
+
vowel generates YAML evaluation specs from function signatures and descriptions. The quality of these specs depends heavily on the system prompt (`EVAL_SPEC_CONTEXT`) used during generation. This tool:
|
|
38
|
+
|
|
39
|
+
1. **Evaluates** the current prompt against a set of reference functions
|
|
40
|
+
2. **Optimizes** the prompt via GEPA's evolutionary search
|
|
41
|
+
3. **Compares** baseline vs optimized performance
|
|
42
|
+
|
|
43
|
+
### Debug
|
|
44
|
+
|
|
45
|
+
Set LOGFIRE_ENABLED to true for enabling debugging and monitoring.
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
export LOGFIRE_ENABLED=true
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
*or*
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
echo "LOGFIRE_ENABLED=true" > .env
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### How It Works
|
|
58
|
+
|
|
59
|
+
```
|
|
60
|
+
EVAL_SPEC_CONTEXT (prompt)
|
|
61
|
+
↓
|
|
62
|
+
Generate YAML evals
|
|
63
|
+
↓
|
|
64
|
+
Run against ground truth
|
|
65
|
+
↓
|
|
66
|
+
Score (pass rate)
|
|
67
|
+
↓
|
|
68
|
+
GEPA: propose improvements
|
|
69
|
+
↓
|
|
70
|
+
[repeat until convergence]
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Each iteration:
|
|
74
|
+
- Generates eval specs for reference functions (e.g., `json_encode`, `slugify`, `levenshtein`)
|
|
75
|
+
- Executes specs against original implementations
|
|
76
|
+
- Diagnoses failures (wrong expected values, invented raises, over-strict assertions)
|
|
77
|
+
- Feeds structured failure reports to a proposer LLM to suggest prompt refinements
|
|
78
|
+
|
|
79
|
+
---
|
|
80
|
+
|
|
81
|
+
## Installation
|
|
82
|
+
|
|
83
|
+
### From source
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
git clone <this-repo>
|
|
87
|
+
cd vowel-optimization
|
|
88
|
+
pip install -e .
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
### Dependencies
|
|
92
|
+
|
|
93
|
+
- Python 3.11+
|
|
94
|
+
- vowel
|
|
95
|
+
- GEPA
|
|
96
|
+
- pydantic-ai
|
|
97
|
+
- logfire (for telemetry)
|
|
98
|
+
|
|
99
|
+
---
|
|
100
|
+
|
|
101
|
+
## Quick Start
|
|
102
|
+
|
|
103
|
+
### 1. Evaluate Current Prompt
|
|
104
|
+
|
|
105
|
+
Test the default `EVAL_SPEC_CONTEXT` against all reference functions:
|
|
106
|
+
|
|
107
|
+
```bash
|
|
108
|
+
python -m vowel_optimization eval
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
**Output:**
|
|
112
|
+
```
|
|
113
|
+
============================================================
|
|
114
|
+
Evaluating [current] context (2453 chars)
|
|
115
|
+
============================================================
|
|
116
|
+
|
|
117
|
+
json_encode... 88% (15/17)
|
|
118
|
+
slugify... 92% (23/25)
|
|
119
|
+
levenshtein... 100% (12/12)
|
|
120
|
+
...
|
|
121
|
+
|
|
122
|
+
------------------------------------------------------------
|
|
123
|
+
Average pass rate: 91%
|
|
124
|
+
Total: 143/157 cases passed
|
|
125
|
+
|
|
126
|
+
Failure breakdown:
|
|
127
|
+
WRONG_EXPECTED: 8
|
|
128
|
+
INVENTED_RAISES: 4
|
|
129
|
+
FORMAT_MISMATCH: 2
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### 2. Run Optimization
|
|
133
|
+
|
|
134
|
+
Use GEPA to improve the prompt over 50 metric evaluations:
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
python -m vowel_optimization optimize --max-calls 50
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
**What happens:**
|
|
141
|
+
- GEPA starts with `EVAL_SPEC_CONTEXT` as seed
|
|
142
|
+
- Evaluates on reference functions
|
|
143
|
+
- Proposes improved versions via LLM reflection
|
|
144
|
+
- Saves best candidate to `optimized_context.txt`
|
|
145
|
+
|
|
146
|
+
**Options:**
|
|
147
|
+
- `--max-calls N`: Maximum GEPA iterations (default: 50)
|
|
148
|
+
- `--output path/to/file.txt`: Save location for optimized prompt
|
|
149
|
+
- `--model MODEL`: Override eval model (default: `openrouter:google/gemini-3-flash-preview`)
|
|
150
|
+
- `--proposer-model MODEL`: Override proposer model
|
|
151
|
+
- `--seed-file path/to/seed.txt`: Start from a previously optimized prompt instead of default
|
|
152
|
+
|
|
153
|
+
**Example:**
|
|
154
|
+
```bash
|
|
155
|
+
# Start from previous best, run 100 more iterations
|
|
156
|
+
python -m vowel_optimization optimize \
|
|
157
|
+
--max-calls 100 \
|
|
158
|
+
--seed-file optimized_context.txt \
|
|
159
|
+
--output optimized_context_v2.txt
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### 3. Compare Results
|
|
163
|
+
|
|
164
|
+
Compare prompts against each other or against the default:
|
|
165
|
+
|
|
166
|
+
```bash
|
|
167
|
+
# Compare one file vs default EVAL_SPEC_CONTEXT
|
|
168
|
+
python -m vowel_optimization compare optimized_context.txt
|
|
169
|
+
|
|
170
|
+
# Compare two specific files
|
|
171
|
+
python -m vowel_optimization compare optimized_context.txt optimized_context_v3.txt
|
|
172
|
+
|
|
173
|
+
# Compare multiple files
|
|
174
|
+
python -m vowel_optimization compare v1.txt v2.txt v3.txt
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
**Output (1 file vs default):**
|
|
178
|
+
```
|
|
179
|
+
1. Current EVAL_SPEC_CONTEXT:
|
|
180
|
+
============================================================
|
|
181
|
+
Average pass rate: 91%
|
|
182
|
+
...
|
|
183
|
+
|
|
184
|
+
2. optimized_context.txt:
|
|
185
|
+
============================================================
|
|
186
|
+
Average pass rate: 96%
|
|
187
|
+
...
|
|
188
|
+
|
|
189
|
+
============================================================
|
|
190
|
+
Comparison:
|
|
191
|
+
EVAL_SPEC_CONTEXT: 91%
|
|
192
|
+
optimized_context.txt: 96%
|
|
193
|
+
Δ: +5%
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Output (2+ files):**
|
|
197
|
+
```
|
|
198
|
+
1. optimized_context.txt:
|
|
199
|
+
============================================================
|
|
200
|
+
Average pass rate: 85%
|
|
201
|
+
...
|
|
202
|
+
|
|
203
|
+
2. optimized_context_v3.txt:
|
|
204
|
+
============================================================
|
|
205
|
+
Average pass rate: 99%
|
|
206
|
+
...
|
|
207
|
+
|
|
208
|
+
============================================================
|
|
209
|
+
Comparison:
|
|
210
|
+
optimized_context.txt: 85%
|
|
211
|
+
optimized_context_v3.txt: 99%
|
|
212
|
+
Δ: +14%
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
### 4. Evaluate Custom Prompts
|
|
216
|
+
|
|
217
|
+
Test any saved prompt file:
|
|
218
|
+
|
|
219
|
+
```bash
|
|
220
|
+
python -m vowel_optimization eval --context-file my_experiments/prompt_v3.txt
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
## Changing Models
|
|
226
|
+
|
|
227
|
+
All models use the format `provider:model-name`. Default: `openrouter:google/gemini-3-flash-preview`
|
|
228
|
+
|
|
229
|
+
### Eval Model
|
|
230
|
+
|
|
231
|
+
Controls the LLM that generates eval specs during optimization:
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
python -m vowel_optimization eval --model openrouter:anthropic/claude-3.5-sonnet
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
```bash
|
|
238
|
+
python -m vowel_optimization optimize --model openai:gpt-4o
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
### Proposer Model
|
|
242
|
+
|
|
243
|
+
Controls the LLM that suggests prompt improvements (GEPA's reflection step):
|
|
244
|
+
|
|
245
|
+
```bash
|
|
246
|
+
python -m vowel_optimization optimize \
|
|
247
|
+
--model openrouter:google/gemini-3-flash-preview \
|
|
248
|
+
--proposer-model openrouter:anthropic/claude-3.5-sonnet
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
**Tip:** Use a stronger model for the proposer (e.g., Claude Sonnet) and a faster/cheaper model for eval generation (e.g., Gemini Flash).
|
|
252
|
+
|
|
253
|
+
---
|
|
254
|
+
|
|
255
|
+
## Project Structure
|
|
256
|
+
|
|
257
|
+
```
|
|
258
|
+
vowel_optimization/
|
|
259
|
+
├── .git/
|
|
260
|
+
├── .gitignore
|
|
261
|
+
├── LICENSE
|
|
262
|
+
├── README.md
|
|
263
|
+
├── pyproject.toml
|
|
264
|
+
├── results/ # Optimization outputs
|
|
265
|
+
└── src/
|
|
266
|
+
└── vowel_optimization/
|
|
267
|
+
├── __init__.py
|
|
268
|
+
├── __main__.py # CLI entry point
|
|
269
|
+
├── run_optimization.py # Main commands (eval, optimize, compare)
|
|
270
|
+
├── adapter.py # GEPA adapter implementation
|
|
271
|
+
├── task.py # Core: generate + score eval specs
|
|
272
|
+
├── functions.py # Reference function wrappers
|
|
273
|
+
└── definitions.py # Ground truth implementations
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
### Key Files
|
|
277
|
+
|
|
278
|
+
- **run_optimization.py**: CLI commands and orchestration
|
|
279
|
+
- **adapter.py**: Bridges GEPA's optimization loop with vowel's eval generation
|
|
280
|
+
- **task.py**: Implements `generate_and_score()` — generates YAML, runs tests, diagnoses failures
|
|
281
|
+
- **definitions.py**: Reference functions (json_encode, slugify, levenshtein, etc.)
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
## How Optimization Works
|
|
286
|
+
|
|
287
|
+
### GEPA Adapter
|
|
288
|
+
|
|
289
|
+
The `VowelGEPAAdapter` implements three key methods:
|
|
290
|
+
|
|
291
|
+
1. **evaluate()**: Generate eval specs for each function → run → return pass rates
|
|
292
|
+
2. **make_reflective_dataset()**: Convert failures into structured feedback (e.g., "Expected value wrong for case X")
|
|
293
|
+
3. **propose_new_texts()**: Call proposer LLM with failure diagnostics → get improved prompt
|
|
294
|
+
|
|
295
|
+
### Failure Categories
|
|
296
|
+
|
|
297
|
+
task.py classifies failures into:
|
|
298
|
+
|
|
299
|
+
- **WRONG_EXPECTED**: Generated expected value doesn't match actual output
|
|
300
|
+
- **INVENTED_RAISES**: Spec expects exception but function returns normally
|
|
301
|
+
- **FORMAT_MISMATCH**: String formatting details wrong (separators, whitespace)
|
|
302
|
+
- **OVER_STRICT_ASSERTION**: Assertion too brittle (e.g., exact float equality)
|
|
303
|
+
- **BAD_INPUT**: Invalid input in generated test case
|
|
304
|
+
|
|
305
|
+
These drive prompt refinements like:
|
|
306
|
+
- "Add guidance to trace algorithm logic for expected values"
|
|
307
|
+
- "Only test raises for code paths that actually throw"
|
|
308
|
+
- "Use lenient type checking for bool/int compatibility"
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Telemetry
|
|
313
|
+
|
|
314
|
+
Uses [Logfire](https://logfire.dev/) for observability. Logs:
|
|
315
|
+
- Each evaluation run with scores and failure breakdown
|
|
316
|
+
- GEPA optimization progress (candidates, scores, best context)
|
|
317
|
+
- Individual function case results
|
|
318
|
+
|
|
319
|
+
Configure with `LOGFIRE_TOKEN` env var to enable cloud logging.
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## Adding Reference Functions
|
|
324
|
+
|
|
325
|
+
Edit `definitions.py`:
|
|
326
|
+
|
|
327
|
+
```python
|
|
328
|
+
# definitions.py
|
|
329
|
+
def my_new_function(x: int) -> int:
|
|
330
|
+
"""Double the input."""
|
|
331
|
+
return x * 2
|
|
332
|
+
|
|
333
|
+
# Add to FUNCTIONS dict at bottom
|
|
334
|
+
FUNCTIONS = {
|
|
335
|
+
# ... existing functions ...
|
|
336
|
+
"my_new_function": {
|
|
337
|
+
"func": my_new_function,
|
|
338
|
+
"description": "Doubles integer input",
|
|
339
|
+
},
|
|
340
|
+
}
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
The optimizer will automatically include it in the next run.
|
|
344
|
+
|
|
345
|
+
---
|
|
346
|
+
|
|
347
|
+
## Examples
|
|
348
|
+
|
|
349
|
+
### Baseline Evaluation
|
|
350
|
+
|
|
351
|
+
```bash
|
|
352
|
+
$ python -m vowel_optimization eval
|
|
353
|
+
|
|
354
|
+
============================================================
|
|
355
|
+
Evaluating [current] context (2453 chars)
|
|
356
|
+
============================================================
|
|
357
|
+
|
|
358
|
+
json_encode... 88% (15/17)
|
|
359
|
+
slugify... 92% (23/25)
|
|
360
|
+
levenshtein... 100% (12/12)
|
|
361
|
+
|
|
362
|
+
------------------------------------------------------------
|
|
363
|
+
Average pass rate: 91%
|
|
364
|
+
Total: 143/157 cases passed
|
|
365
|
+
```
|
|
366
|
+
|
|
367
|
+
### Full Optimization Run
|
|
368
|
+
|
|
369
|
+
```bash
|
|
370
|
+
$ python -m vowel_optimization optimize --max-calls 30
|
|
371
|
+
|
|
372
|
+
Starting GEPA prompt optimization...
|
|
373
|
+
Eval model: openrouter:google/gemini-3-flash-preview
|
|
374
|
+
Proposer model: openrouter:google/gemini-3-flash-preview
|
|
375
|
+
Max metric calls: 30
|
|
376
|
+
Seed: default EVAL_SPEC_CONTEXT (2453 chars)
|
|
377
|
+
============================================================
|
|
378
|
+
|
|
379
|
+
[GEPA progress bar with candidate evaluation]
|
|
380
|
+
|
|
381
|
+
============================================================
|
|
382
|
+
Optimization Complete!
|
|
383
|
+
============================================================
|
|
384
|
+
Best validation score: 96.18%
|
|
385
|
+
Context length: 3127 chars
|
|
386
|
+
Saved to: optimized_context.txt
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
### Iterative Refinement
|
|
390
|
+
|
|
391
|
+
```bash
|
|
392
|
+
# First round
|
|
393
|
+
python -m vowel_optimization optimize --max-calls 50 --output opt_v1.txt
|
|
394
|
+
|
|
395
|
+
# Second round starting from v1
|
|
396
|
+
python -m vowel_optimization optimize --max-calls 50 \
|
|
397
|
+
--seed-file opt_v1.txt \
|
|
398
|
+
--output opt_v2.txt
|
|
399
|
+
|
|
400
|
+
# Compare all versions
|
|
401
|
+
python -m vowel_optimization compare opt_v1.txt # vs default
|
|
402
|
+
python -m vowel_optimization compare opt_v2.txt # vs default
|
|
403
|
+
python -m vowel_optimization compare opt_v1.txt opt_v2.txt # direct comparison
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
---
|
|
407
|
+
|
|
408
|
+
## License
|
|
409
|
+
|
|
410
|
+
MIT
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
414
|
+
## Related
|
|
415
|
+
|
|
416
|
+
- **[vowel](https://github.com/fswair/vowel)**: YAML-based evaluation framework
|
|
417
|
+
- **[GEPA](https://github.com/GEPA-ai/GEPA)**: Genetic Pareto for meta-optimization
|
|
418
|
+
- **[pydantic-ai](https://github.com/pydantic/pydantic-ai)**: Type-safe AI agent framework
|
|
419
|
+
|
|
420
|
+
---
|
|
421
|
+
|
|
422
|
+
## Reference
|
|
423
|
+
|
|
424
|
+
Used [@dmontagu](https://github.com/dmontagu)'s [pydantic-ai-GEPA-example](https://github.com/dmontagu/pydantic-ai-GEPA-example) as seed repository.
|