vowel-optimization 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Mert
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,424 @@
1
+ Metadata-Version: 2.4
2
+ Name: vowel-optimization
3
+ Version: 0.1.0
4
+ Summary: GEPA-powered prompt optimization for vowel eval spec generation
5
+ Author-email: Mert <your.email@example.com>
6
+ License: MIT
7
+ Keywords: llm,testing,optimization,prompt-engineering,GEPA
8
+ Classifier: Development Status :: 3 - Alpha
9
+ Classifier: Intended Audience :: Developers
10
+ Classifier: License :: OSI Approved :: MIT License
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Classifier: Topic :: Software Development :: Testing
14
+ Requires-Python: >=3.11
15
+ Description-Content-Type: text/markdown
16
+ License-File: LICENSE
17
+ Requires-Dist: vowel>=0.2.7
18
+ Requires-Dist: GEPA
19
+ Requires-Dist: pydantic-ai
20
+ Requires-Dist: logfire
21
+ Requires-Dist: pyyaml>=6.0
22
+ Provides-Extra: dev
23
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
24
+ Requires-Dist: ruff>=0.1.0; extra == "dev"
25
+ Dynamic: license-file
26
+
27
+ # vowel-optimization
28
+
29
+ **GEPA-powered prompt optimization playground for vowel eval spec generation.**
30
+
31
+ This repository contains a research tool for optimizing vowel's `EVAL_SPEC_CONTEXT` prompt using [GEPA](https://github.com/GEPA-ai/GEPA) (Genetic Pareto). By iteratively generating evaluation specs and measuring their quality against ground-truth function implementations, it discovers improved prompts that produce better test cases.
32
+
33
+ ---
34
+
35
+ ## Overview
36
+
37
+ vowel generates YAML evaluation specs from function signatures and descriptions. The quality of these specs depends heavily on the system prompt (`EVAL_SPEC_CONTEXT`) used during generation. This tool:
38
+
39
+ 1. **Evaluates** the current prompt against a set of reference functions
40
+ 2. **Optimizes** the prompt via GEPA's evolutionary search
41
+ 3. **Compares** baseline vs optimized performance
42
+
43
+ ### Debug
44
+
45
+ Set LOGFIRE_ENABLED to true for enabling debugging and monitoring.
46
+
47
+ ```
48
+ export LOGFIRE_ENABLED=true
49
+ ```
50
+
51
+ *or*
52
+
53
+ ```
54
+ echo "LOGFIRE_ENABLED=true" > .env
55
+ ```
56
+
57
+ ### How It Works
58
+
59
+ ```
60
+ EVAL_SPEC_CONTEXT (prompt)
61
+
62
+ Generate YAML evals
63
+
64
+ Run against ground truth
65
+
66
+ Score (pass rate)
67
+
68
+ GEPA: propose improvements
69
+
70
+ [repeat until convergence]
71
+ ```
72
+
73
+ Each iteration:
74
+ - Generates eval specs for reference functions (e.g., `json_encode`, `slugify`, `levenshtein`)
75
+ - Executes specs against original implementations
76
+ - Diagnoses failures (wrong expected values, invented raises, over-strict assertions)
77
+ - Feeds structured failure reports to a proposer LLM to suggest prompt refinements
78
+
79
+ ---
80
+
81
+ ## Installation
82
+
83
+ ### From source
84
+
85
+ ```bash
86
+ git clone <this-repo>
87
+ cd vowel-optimization
88
+ pip install -e .
89
+ ```
90
+
91
+ ### Dependencies
92
+
93
+ - Python 3.11+
94
+ - vowel
95
+ - GEPA
96
+ - pydantic-ai
97
+ - logfire (for telemetry)
98
+
99
+ ---
100
+
101
+ ## Quick Start
102
+
103
+ ### 1. Evaluate Current Prompt
104
+
105
+ Test the default `EVAL_SPEC_CONTEXT` against all reference functions:
106
+
107
+ ```bash
108
+ python -m vowel_optimization eval
109
+ ```
110
+
111
+ **Output:**
112
+ ```
113
+ ============================================================
114
+ Evaluating [current] context (2453 chars)
115
+ ============================================================
116
+
117
+ json_encode... 88% (15/17)
118
+ slugify... 92% (23/25)
119
+ levenshtein... 100% (12/12)
120
+ ...
121
+
122
+ ------------------------------------------------------------
123
+ Average pass rate: 91%
124
+ Total: 143/157 cases passed
125
+
126
+ Failure breakdown:
127
+ WRONG_EXPECTED: 8
128
+ INVENTED_RAISES: 4
129
+ FORMAT_MISMATCH: 2
130
+ ```
131
+
132
+ ### 2. Run Optimization
133
+
134
+ Use GEPA to improve the prompt over 50 metric evaluations:
135
+
136
+ ```bash
137
+ python -m vowel_optimization optimize --max-calls 50
138
+ ```
139
+
140
+ **What happens:**
141
+ - GEPA starts with `EVAL_SPEC_CONTEXT` as seed
142
+ - Evaluates on reference functions
143
+ - Proposes improved versions via LLM reflection
144
+ - Saves best candidate to `optimized_context.txt`
145
+
146
+ **Options:**
147
+ - `--max-calls N`: Maximum GEPA iterations (default: 50)
148
+ - `--output path/to/file.txt`: Save location for optimized prompt
149
+ - `--model MODEL`: Override eval model (default: `openrouter:google/gemini-3-flash-preview`)
150
+ - `--proposer-model MODEL`: Override proposer model
151
+ - `--seed-file path/to/seed.txt`: Start from a previously optimized prompt instead of default
152
+
153
+ **Example:**
154
+ ```bash
155
+ # Start from previous best, run 100 more iterations
156
+ python -m vowel_optimization optimize \
157
+ --max-calls 100 \
158
+ --seed-file optimized_context.txt \
159
+ --output optimized_context_v2.txt
160
+ ```
161
+
162
+ ### 3. Compare Results
163
+
164
+ Compare prompts against each other or against the default:
165
+
166
+ ```bash
167
+ # Compare one file vs default EVAL_SPEC_CONTEXT
168
+ python -m vowel_optimization compare optimized_context.txt
169
+
170
+ # Compare two specific files
171
+ python -m vowel_optimization compare optimized_context.txt optimized_context_v3.txt
172
+
173
+ # Compare multiple files
174
+ python -m vowel_optimization compare v1.txt v2.txt v3.txt
175
+ ```
176
+
177
+ **Output (1 file vs default):**
178
+ ```
179
+ 1. Current EVAL_SPEC_CONTEXT:
180
+ ============================================================
181
+ Average pass rate: 91%
182
+ ...
183
+
184
+ 2. optimized_context.txt:
185
+ ============================================================
186
+ Average pass rate: 96%
187
+ ...
188
+
189
+ ============================================================
190
+ Comparison:
191
+ EVAL_SPEC_CONTEXT: 91%
192
+ optimized_context.txt: 96%
193
+ Δ: +5%
194
+ ```
195
+
196
+ **Output (2+ files):**
197
+ ```
198
+ 1. optimized_context.txt:
199
+ ============================================================
200
+ Average pass rate: 85%
201
+ ...
202
+
203
+ 2. optimized_context_v3.txt:
204
+ ============================================================
205
+ Average pass rate: 99%
206
+ ...
207
+
208
+ ============================================================
209
+ Comparison:
210
+ optimized_context.txt: 85%
211
+ optimized_context_v3.txt: 99%
212
+ Δ: +14%
213
+ ```
214
+
215
+ ### 4. Evaluate Custom Prompts
216
+
217
+ Test any saved prompt file:
218
+
219
+ ```bash
220
+ python -m vowel_optimization eval --context-file my_experiments/prompt_v3.txt
221
+ ```
222
+
223
+ ---
224
+
225
+ ## Changing Models
226
+
227
+ All models use the format `provider:model-name`. Default: `openrouter:google/gemini-3-flash-preview`
228
+
229
+ ### Eval Model
230
+
231
+ Controls the LLM that generates eval specs during optimization:
232
+
233
+ ```bash
234
+ python -m vowel_optimization eval --model openrouter:anthropic/claude-3.5-sonnet
235
+ ```
236
+
237
+ ```bash
238
+ python -m vowel_optimization optimize --model openai:gpt-4o
239
+ ```
240
+
241
+ ### Proposer Model
242
+
243
+ Controls the LLM that suggests prompt improvements (GEPA's reflection step):
244
+
245
+ ```bash
246
+ python -m vowel_optimization optimize \
247
+ --model openrouter:google/gemini-3-flash-preview \
248
+ --proposer-model openrouter:anthropic/claude-3.5-sonnet
249
+ ```
250
+
251
+ **Tip:** Use a stronger model for the proposer (e.g., Claude Sonnet) and a faster/cheaper model for eval generation (e.g., Gemini Flash).
252
+
253
+ ---
254
+
255
+ ## Project Structure
256
+
257
+ ```
258
+ vowel_optimization/
259
+ ├── .git/
260
+ ├── .gitignore
261
+ ├── LICENSE
262
+ ├── README.md
263
+ ├── pyproject.toml
264
+ ├── results/ # Optimization outputs
265
+ └── src/
266
+ └── vowel_optimization/
267
+ ├── __init__.py
268
+ ├── __main__.py # CLI entry point
269
+ ├── run_optimization.py # Main commands (eval, optimize, compare)
270
+ ├── adapter.py # GEPA adapter implementation
271
+ ├── task.py # Core: generate + score eval specs
272
+ ├── functions.py # Reference function wrappers
273
+ └── definitions.py # Ground truth implementations
274
+ ```
275
+
276
+ ### Key Files
277
+
278
+ - **run_optimization.py**: CLI commands and orchestration
279
+ - **adapter.py**: Bridges GEPA's optimization loop with vowel's eval generation
280
+ - **task.py**: Implements `generate_and_score()` — generates YAML, runs tests, diagnoses failures
281
+ - **definitions.py**: Reference functions (json_encode, slugify, levenshtein, etc.)
282
+
283
+ ---
284
+
285
+ ## How Optimization Works
286
+
287
+ ### GEPA Adapter
288
+
289
+ The `VowelGEPAAdapter` implements three key methods:
290
+
291
+ 1. **evaluate()**: Generate eval specs for each function → run → return pass rates
292
+ 2. **make_reflective_dataset()**: Convert failures into structured feedback (e.g., "Expected value wrong for case X")
293
+ 3. **propose_new_texts()**: Call proposer LLM with failure diagnostics → get improved prompt
294
+
295
+ ### Failure Categories
296
+
297
+ task.py classifies failures into:
298
+
299
+ - **WRONG_EXPECTED**: Generated expected value doesn't match actual output
300
+ - **INVENTED_RAISES**: Spec expects exception but function returns normally
301
+ - **FORMAT_MISMATCH**: String formatting details wrong (separators, whitespace)
302
+ - **OVER_STRICT_ASSERTION**: Assertion too brittle (e.g., exact float equality)
303
+ - **BAD_INPUT**: Invalid input in generated test case
304
+
305
+ These drive prompt refinements like:
306
+ - "Add guidance to trace algorithm logic for expected values"
307
+ - "Only test raises for code paths that actually throw"
308
+ - "Use lenient type checking for bool/int compatibility"
309
+
310
+ ---
311
+
312
+ ## Telemetry
313
+
314
+ Uses [Logfire](https://logfire.dev/) for observability. Logs:
315
+ - Each evaluation run with scores and failure breakdown
316
+ - GEPA optimization progress (candidates, scores, best context)
317
+ - Individual function case results
318
+
319
+ Configure with `LOGFIRE_TOKEN` env var to enable cloud logging.
320
+
321
+ ---
322
+
323
+ ## Adding Reference Functions
324
+
325
+ Edit `definitions.py`:
326
+
327
+ ```python
328
+ # definitions.py
329
+ def my_new_function(x: int) -> int:
330
+ """Double the input."""
331
+ return x * 2
332
+
333
+ # Add to FUNCTIONS dict at bottom
334
+ FUNCTIONS = {
335
+ # ... existing functions ...
336
+ "my_new_function": {
337
+ "func": my_new_function,
338
+ "description": "Doubles integer input",
339
+ },
340
+ }
341
+ ```
342
+
343
+ The optimizer will automatically include it in the next run.
344
+
345
+ ---
346
+
347
+ ## Examples
348
+
349
+ ### Baseline Evaluation
350
+
351
+ ```bash
352
+ $ python -m vowel_optimization eval
353
+
354
+ ============================================================
355
+ Evaluating [current] context (2453 chars)
356
+ ============================================================
357
+
358
+ json_encode... 88% (15/17)
359
+ slugify... 92% (23/25)
360
+ levenshtein... 100% (12/12)
361
+
362
+ ------------------------------------------------------------
363
+ Average pass rate: 91%
364
+ Total: 143/157 cases passed
365
+ ```
366
+
367
+ ### Full Optimization Run
368
+
369
+ ```bash
370
+ $ python -m vowel_optimization optimize --max-calls 30
371
+
372
+ Starting GEPA prompt optimization...
373
+ Eval model: openrouter:google/gemini-3-flash-preview
374
+ Proposer model: openrouter:google/gemini-3-flash-preview
375
+ Max metric calls: 30
376
+ Seed: default EVAL_SPEC_CONTEXT (2453 chars)
377
+ ============================================================
378
+
379
+ [GEPA progress bar with candidate evaluation]
380
+
381
+ ============================================================
382
+ Optimization Complete!
383
+ ============================================================
384
+ Best validation score: 96.18%
385
+ Context length: 3127 chars
386
+ Saved to: optimized_context.txt
387
+ ```
388
+
389
+ ### Iterative Refinement
390
+
391
+ ```bash
392
+ # First round
393
+ python -m vowel_optimization optimize --max-calls 50 --output opt_v1.txt
394
+
395
+ # Second round starting from v1
396
+ python -m vowel_optimization optimize --max-calls 50 \
397
+ --seed-file opt_v1.txt \
398
+ --output opt_v2.txt
399
+
400
+ # Compare all versions
401
+ python -m vowel_optimization compare opt_v1.txt # vs default
402
+ python -m vowel_optimization compare opt_v2.txt # vs default
403
+ python -m vowel_optimization compare opt_v1.txt opt_v2.txt # direct comparison
404
+ ```
405
+
406
+ ---
407
+
408
+ ## License
409
+
410
+ MIT
411
+
412
+ ---
413
+
414
+ ## Related
415
+
416
+ - **[vowel](https://github.com/fswair/vowel)**: YAML-based evaluation framework
417
+ - **[GEPA](https://github.com/GEPA-ai/GEPA)**: Genetic Pareto for meta-optimization
418
+ - **[pydantic-ai](https://github.com/pydantic/pydantic-ai)**: Type-safe AI agent framework
419
+
420
+ ---
421
+
422
+ ## Reference
423
+
424
+ Used [@dmontagu](https://github.com/dmontagu)'s [pydantic-ai-GEPA-example](https://github.com/dmontagu/pydantic-ai-GEPA-example) as seed repository.