weco 0.2.8__tar.gz → 0.2.9__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {weco-0.2.8 → weco-0.2.9}/.github/workflows/release.yml +2 -2
- {weco-0.2.8 → weco-0.2.9}/PKG-INFO +4 -3
- {weco-0.2.8 → weco-0.2.9}/README.md +3 -2
- weco-0.2.9/examples/metal/README.md +39 -0
- weco-0.2.9/examples/prompt/README.md +100 -0
- weco-0.2.9/examples/prompt/eval.py +135 -0
- weco-0.2.9/examples/prompt/optimize.py +34 -0
- weco-0.2.9/examples/prompt/prompt_guide.md +45 -0
- weco-0.2.9/examples/triton/README.md +38 -0
- {weco-0.2.8 → weco-0.2.9}/pyproject.toml +1 -1
- {weco-0.2.8 → weco-0.2.9}/weco/__init__.py +1 -1
- {weco-0.2.8 → weco-0.2.9}/weco/cli.py +27 -10
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/PKG-INFO +4 -3
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/SOURCES.txt +4 -0
- weco-0.2.8/examples/metal/README.md +0 -0
- weco-0.2.8/examples/triton/README.md +0 -0
- {weco-0.2.8 → weco-0.2.9}/.github/workflows/lint.yml +0 -0
- {weco-0.2.8 → weco-0.2.9}/.gitignore +0 -0
- {weco-0.2.8 → weco-0.2.9}/LICENSE +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/cuda/README.md +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/cuda/evaluate.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/cuda/guide.md +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/cuda/optimize.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/hello-kernel-world/evaluate.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/hello-kernel-world/optimize.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/metal/evaluate.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/metal/examples.rst +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/metal/optimize.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/README.md +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/baseline.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/evaluate.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/optimize.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/requirements-test.txt +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/spaceship-titanic/utils.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/triton/evaluate.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/examples/triton/optimize.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/setup.cfg +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco/api.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco/panels.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco/utils.py +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/dependency_links.txt +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/entry_points.txt +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/requires.txt +0 -0
- {weco-0.2.8 → weco-0.2.9}/weco.egg-info/top_level.txt +0 -0
|
@@ -90,7 +90,7 @@ jobs:
|
|
|
90
90
|
GITHUB_TOKEN: ${{ github.token }}
|
|
91
91
|
run: >-
|
|
92
92
|
gh release create
|
|
93
|
-
'v0.2.
|
|
93
|
+
'v0.2.9'
|
|
94
94
|
--repo '${{ github.repository }}'
|
|
95
95
|
--notes ""
|
|
96
96
|
|
|
@@ -102,5 +102,5 @@ jobs:
|
|
|
102
102
|
# sigstore-produced signatures and certificates.
|
|
103
103
|
run: >-
|
|
104
104
|
gh release upload
|
|
105
|
-
'v0.2.
|
|
105
|
+
'v0.2.9' dist/**
|
|
106
106
|
--repo '${{ github.repository }}'
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: weco
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.9
|
|
4
4
|
Summary: Documentation for `weco`, a CLI for using Weco AI's code optimizer.
|
|
5
5
|
Author-email: Weco AI Team <contact@weco.ai>
|
|
6
6
|
License: MIT
|
|
@@ -76,7 +76,7 @@ The `weco` CLI leverages a tree search approach guided by Large Language Models
|
|
|
76
76
|
|
|
77
77
|
This basic example shows how to optimize a simple PyTorch function for speedup.
|
|
78
78
|
|
|
79
|
-
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)
|
|
79
|
+
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)**, please see the `README.md` files within the corresponding subdirectories under the [`examples/`](./examples/) folder.
|
|
80
80
|
|
|
81
81
|
```bash
|
|
82
82
|
# Navigate to the example directory
|
|
@@ -108,9 +108,10 @@ weco --source optimize.py \
|
|
|
108
108
|
| `--metric` | The name of the metric you want to optimize (e.g., 'accuracy', 'speedup', 'loss'). This metric name should match what's printed by your `--eval-command`. | Yes |
|
|
109
109
|
| `--maximize` | Whether to maximize (`true`) or minimize (`false`) the metric. | Yes |
|
|
110
110
|
| `--steps` | Number of optimization steps (LLM iterations) to run. | Yes |
|
|
111
|
-
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.
|
|
111
|
+
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.7-sonnet`). Recommended models to try include `o4-mini`, and `gemini-2.5-pro-exp-03-25`.| Yes |
|
|
112
112
|
| `--additional-instructions` | (Optional) Natural language description of specific instructions OR path to a file containing detailed instructions to guide the LLM. | No |
|
|
113
113
|
| `--log-dir` | (Optional) Path to the directory to log intermediate steps and final optimization result. Defaults to `.runs/`. | No |
|
|
114
|
+
| `--preserve-source` | (Optional) If set, do not overwrite the original `--source` file. Modifications and the best solution will still be saved in the `--log-dir`. | No |
|
|
114
115
|
|
|
115
116
|
---
|
|
116
117
|
|
|
@@ -54,7 +54,7 @@ The `weco` CLI leverages a tree search approach guided by Large Language Models
|
|
|
54
54
|
|
|
55
55
|
This basic example shows how to optimize a simple PyTorch function for speedup.
|
|
56
56
|
|
|
57
|
-
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)
|
|
57
|
+
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)**, please see the `README.md` files within the corresponding subdirectories under the [`examples/`](./examples/) folder.
|
|
58
58
|
|
|
59
59
|
```bash
|
|
60
60
|
# Navigate to the example directory
|
|
@@ -86,9 +86,10 @@ weco --source optimize.py \
|
|
|
86
86
|
| `--metric` | The name of the metric you want to optimize (e.g., 'accuracy', 'speedup', 'loss'). This metric name should match what's printed by your `--eval-command`. | Yes |
|
|
87
87
|
| `--maximize` | Whether to maximize (`true`) or minimize (`false`) the metric. | Yes |
|
|
88
88
|
| `--steps` | Number of optimization steps (LLM iterations) to run. | Yes |
|
|
89
|
-
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.
|
|
89
|
+
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.7-sonnet`). Recommended models to try include `o4-mini`, and `gemini-2.5-pro-exp-03-25`.| Yes |
|
|
90
90
|
| `--additional-instructions` | (Optional) Natural language description of specific instructions OR path to a file containing detailed instructions to guide the LLM. | No |
|
|
91
91
|
| `--log-dir` | (Optional) Path to the directory to log intermediate steps and final optimization result. Defaults to `.runs/`. | No |
|
|
92
|
+
| `--preserve-source` | (Optional) If set, do not overwrite the original `--source` file. Modifications and the best solution will still be saved in the `--log-dir`. | No |
|
|
92
93
|
|
|
93
94
|
---
|
|
94
95
|
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
# Example: Optimizing MLX Convolution with Metal
|
|
2
|
+
|
|
3
|
+
This example demonstrates how to use Weco to optimize a 2D convolution operation implemented in [`mlx`](https://github.com/ml-explore/mlx), targeting Apple's [Metal](https://developer.apple.com/documentation/metal/) framework for execution on Apple Silicon GPUs.
|
|
4
|
+
|
|
5
|
+
It showcases using a separate file (`examples.rst`) to provide detailed context and instructions to the optimizing LLM.
|
|
6
|
+
|
|
7
|
+
## Setup
|
|
8
|
+
|
|
9
|
+
1. Ensure you are in the `examples/metal` directory.
|
|
10
|
+
2. Install the required dependency:
|
|
11
|
+
```bash
|
|
12
|
+
pip install mlx
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Optimization Command
|
|
16
|
+
|
|
17
|
+
Run the following command to start the optimization process:
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
weco --source optimize.py \
|
|
21
|
+
--eval-command "python evaluate.py --solution-path optimize.py" \
|
|
22
|
+
--metric speedup \
|
|
23
|
+
--maximize true \
|
|
24
|
+
--steps 30 \
|
|
25
|
+
--model gemini-2.5-pro-exp-03-25 \
|
|
26
|
+
--additional-instructions examples.rst
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### Explanation
|
|
30
|
+
|
|
31
|
+
* `--source optimize.py`: Specifies the Python file containing the MLX convolution code to be optimized.
|
|
32
|
+
* `--eval-command "python evaluate.py --solution-path optimize.py"`: Runs the evaluation script. `evaluate.py` executes the code in `optimize.py`, measures its performance against a baseline, and prints the `speedup` metric.
|
|
33
|
+
* `--metric speedup`: Tells Weco to target the 'speedup' value printed by the evaluation command.
|
|
34
|
+
* `--maximize true`: Instructs Weco to aim for a higher speedup value.
|
|
35
|
+
* `--steps 30`: Defines the number of iterative optimization steps Weco will perform.
|
|
36
|
+
* `--model gemini-2.5-pro-exp-03-25`: Selects the LLM used for proposing code modifications.
|
|
37
|
+
* `--additional-instructions examples.rst`: Provides a path to a file containing detailed guidance for the LLM during optimization (e.g., constraints, preferred Metal techniques).
|
|
38
|
+
|
|
39
|
+
Weco will iteratively modify `optimize.py`, run `evaluate.py`, parse the `speedup`, and generate new code versions based on the results and the instructions in `examples.rst`.
|
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# weco-cli/examples/prompt/README.md
|
|
2
|
+
# AIME Prompt Engineering Example with Weco
|
|
3
|
+
|
|
4
|
+
This example shows how **Weco** can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files, and aims to improve the accuracy metric.
|
|
5
|
+
|
|
6
|
+
This example uses `gpt-4o-mini` via the OpenAI API by default. Ensure your `OPENAI_API_KEY` environment variable is set.
|
|
7
|
+
|
|
8
|
+
## Files in this folder
|
|
9
|
+
|
|
10
|
+
| File | Purpose |
|
|
11
|
+
| :------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
12
|
+
| `optimize.py` | Holds the prompt template (instructing the LLM to reason step-by-step and use `\\boxed{}` for the final answer) and the mutable `EXTRA_INSTRUCTIONS` string. Weco edits **only** this file during the search. |
|
|
13
|
+
| `eval.py` | Downloads a small slice of the 2024 AIME dataset, calls `optimize.solve` in parallel, parses the LLM output (looking for `\\boxed{}`), compares it to the ground truth, prints progress logs, and finally prints an `accuracy:` line that Weco reads. |
|
|
14
|
+
|
|
15
|
+
## Quick start
|
|
16
|
+
|
|
17
|
+
1. **Clone the repository and enter the folder.**
|
|
18
|
+
```bash
|
|
19
|
+
# If you cloned the main weco-cli repo already:
|
|
20
|
+
cd examples/prompt
|
|
21
|
+
|
|
22
|
+
# Otherwise:
|
|
23
|
+
# git clone https://github.com/WecoAI/weco-cli.git
|
|
24
|
+
# cd weco-cli/examples/prompt
|
|
25
|
+
```
|
|
26
|
+
2. **Install dependencies.**
|
|
27
|
+
```bash
|
|
28
|
+
# Ensure you have weco installed: pip install weco
|
|
29
|
+
pip install openai datasets # Add any other dependencies if needed
|
|
30
|
+
```
|
|
31
|
+
3. **Set your OpenAI API Key.**
|
|
32
|
+
```bash
|
|
33
|
+
export OPENAI_API_KEY="your_openai_api_key_here"
|
|
34
|
+
```
|
|
35
|
+
4. **Run Weco.** The command below iteratively modifies `EXTRA_INSTRUCTIONS` in `optimize.py`, runs `eval.py` to evaluate the prompt's effectiveness, reads the printed accuracy, and keeps the best prompt variations found.
|
|
36
|
+
```bash
|
|
37
|
+
weco --source optimize.py \
|
|
38
|
+
--eval-command "python eval.py" \
|
|
39
|
+
--metric accuracy \
|
|
40
|
+
--maximize true \
|
|
41
|
+
--steps 40 \
|
|
42
|
+
--model gemini-2.5-pro-exp-03-25
|
|
43
|
+
```
|
|
44
|
+
*Note: You can replace `--model gemini-2.5-pro-exp-03-25` with another powerful model like `o3` if you have the respective API keys set.*
|
|
45
|
+
|
|
46
|
+
During each evaluation round, you will see log lines similar to the following:
|
|
47
|
+
|
|
48
|
+
```text
|
|
49
|
+
[setup] loading 20 problems from AIME 2024 …
|
|
50
|
+
[progress] 5/20 completed, accuracy: 0.0000, elapsed 7.3 s
|
|
51
|
+
[progress] 10/20 completed, accuracy: 0.1000, elapsed 14.6 s
|
|
52
|
+
[progress] 15/20 completed, accuracy: 0.0667, elapsed 21.8 s
|
|
53
|
+
[progress] 20/20 completed, accuracy: 0.0500, elapsed 28.9 s
|
|
54
|
+
accuracy: 0.0500# AIME 2024 Prompt‑Engineering Example
|
|
55
|
+
This example shows how **Weco** can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files, and finishes in a few hours on a laptop.
|
|
56
|
+
|
|
57
|
+
## Files in this folder
|
|
58
|
+
|
|
59
|
+
| File | Purpose |
|
|
60
|
+
| :------------ | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
61
|
+
| `optimize.py` | Holds the prompt template (instructing the LLM to reason step-by-step and use `\\boxed{}` for the final answer) and the function to call the LLM. Weco edits **only** this file during the search to refine the prompt template. |
|
|
62
|
+
| `eval.py` | Defines the LLM model to use (`MODEL_TO_USE`). Downloads a small slice of the 2024 AIME dataset, calls `optimize.solve` in parallel (passing the chosen model), parses the LLM output, compares it to the ground truth, prints progress logs, and finally prints an `accuracy:` line that Weco reads. |
|
|
63
|
+
|
|
64
|
+
## Quick start
|
|
65
|
+
|
|
66
|
+
1. **Clone the repository and enter the folder.**
|
|
67
|
+
```bash
|
|
68
|
+
git clone https://github.com/your‑fork/weco‑examples.git
|
|
69
|
+
cd weco‑examples/aime‑2024
|
|
70
|
+
```
|
|
71
|
+
2. **Run Weco.** The command below edits `EXTRA_INSTRUCTIONS` in `optimize.py`, invokes `eval.py` on every iteration, reads the printed accuracy, and keeps the best variants.
|
|
72
|
+
```bash
|
|
73
|
+
weco --source optimize.py \
|
|
74
|
+
--eval-command "python eval.py" \
|
|
75
|
+
--metric accuracy \
|
|
76
|
+
--maximize true \
|
|
77
|
+
--steps 40 \
|
|
78
|
+
--model gemini-2.5-flash-preview-04-17 \
|
|
79
|
+
--addtional-instructions prompt_guide.md
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
During each evaluation round you will see log lines similar to the following.
|
|
83
|
+
|
|
84
|
+
```text
|
|
85
|
+
[setup] loading 20 problems from AIME 2024 …
|
|
86
|
+
[progress] 5/20 completed, elapsed 7.3 s
|
|
87
|
+
[progress] 10/20 completed, elapsed 14.6 s
|
|
88
|
+
[progress] 15/20 completed, elapsed 21.8 s
|
|
89
|
+
[progress] 20/20 completed, elapsed 28.9 s
|
|
90
|
+
accuracy: 0.0500
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
Weco then mutates the config, tries again, and gradually pushes the accuracy higher. On a modern laptop you can usually double the baseline score within thirty to forty iterations.
|
|
94
|
+
|
|
95
|
+
## How it works
|
|
96
|
+
|
|
97
|
+
* `eval_aime.py` slices the **Maxwell‑Jia/AIME_2024** dataset to twenty problems for fast feedback. You can change the slice in one line.
|
|
98
|
+
* The script sends model calls in parallel via `ThreadPoolExecutor`, so network latency is hidden.
|
|
99
|
+
* Every five completed items, the script logs progress and elapsed time.
|
|
100
|
+
* The final line `accuracy: value` is the only part Weco needs for guidance.
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# weco-cli/examples/prompt/eval.py
|
|
2
|
+
"""
|
|
3
|
+
eval.py (parallel with progress logs)
|
|
4
|
+
|
|
5
|
+
Downloads a slice of AIME 2024, calls optimize.solve in parallel,
|
|
6
|
+
prints progress every N samples, and finally prints accuracy
|
|
7
|
+
in the format that Weco expects.
|
|
8
|
+
The LLM model to use is defined in this file.
|
|
9
|
+
"""
|
|
10
|
+
|
|
11
|
+
import re
|
|
12
|
+
import time
|
|
13
|
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
|
14
|
+
import sys
|
|
15
|
+
import concurrent.futures
|
|
16
|
+
|
|
17
|
+
from datasets import load_dataset
|
|
18
|
+
import optimize # the file Weco mutates
|
|
19
|
+
|
|
20
|
+
# ---------------------------------------------------------------------
|
|
21
|
+
# Configuration
|
|
22
|
+
TOTAL_SAMPLES = 30 # how many problems to load
|
|
23
|
+
NUM_WORKERS = 30 # concurrent LLM calls
|
|
24
|
+
LOG_EVERY = 5 # print progress after this many
|
|
25
|
+
MODEL_TO_USE = "gpt-4.1" # Define the model to use HERE
|
|
26
|
+
TASK_TIMEOUT = 300 # seconds per LLM call
|
|
27
|
+
# ---------------------------------------------------------------------
|
|
28
|
+
|
|
29
|
+
print(f"[setup] loading {TOTAL_SAMPLES} problems from AIME 2024 …", flush=True)
|
|
30
|
+
DATA = load_dataset("Maxwell-Jia/AIME_2024", split=f"train[:{TOTAL_SAMPLES}]", cache_dir=".cache")
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
def extract_final_answer(text: str) -> str:
|
|
34
|
+
"""
|
|
35
|
+
Extracts the final AIME answer (000-999) from the LLM response.
|
|
36
|
+
Prioritizes answers within \boxed{}, then looks for patterns,
|
|
37
|
+
and falls back to finding the last 3-digit number.
|
|
38
|
+
"""
|
|
39
|
+
# 1. Check for \boxed{...}
|
|
40
|
+
boxed_match = re.search(r"\\boxed\{(\d{1,3})\}", text)
|
|
41
|
+
if boxed_match:
|
|
42
|
+
return boxed_match.group(1).zfill(3) # Pad with leading zeros if needed
|
|
43
|
+
|
|
44
|
+
# 2. Check for "final answer is ..." patterns (case-insensitive)
|
|
45
|
+
# Make sure pattern captures potential variations like "is: 123", "is 123."
|
|
46
|
+
answer_pattern = r"(?:final|answer is|result is)[:\s]*(\d{1,3})\b"
|
|
47
|
+
answer_match = re.search(answer_pattern, text, re.IGNORECASE)
|
|
48
|
+
if answer_match:
|
|
49
|
+
return answer_match.group(1).zfill(3)
|
|
50
|
+
|
|
51
|
+
# 3. Fallback: Find the last occurrence of a 1-3 digit number in the text
|
|
52
|
+
# This is less reliable but can be a fallback.
|
|
53
|
+
# Let's refine the fallback regex to be slightly more specific
|
|
54
|
+
# Look for isolated 1-3 digit numbers, possibly at the end or after keywords.
|
|
55
|
+
fallback_matches = re.findall(r"\b(\d{1,3})\b", text)
|
|
56
|
+
if fallback_matches:
|
|
57
|
+
# Return the last found number, assuming it's the most likely answer candidate
|
|
58
|
+
return fallback_matches[-1].zfill(3)
|
|
59
|
+
|
|
60
|
+
return "" # Return empty if no answer found
|
|
61
|
+
|
|
62
|
+
|
|
63
|
+
def grade_answer(llm_output: str, ground_truth_answer: str) -> bool:
|
|
64
|
+
"""Compares the extracted LLM answer to the ground truth."""
|
|
65
|
+
extracted_guess = extract_final_answer(llm_output)
|
|
66
|
+
# Ground truth answers in AIME are typically strings "000" to "999"
|
|
67
|
+
# Ensure comparison is consistent (e.g., both as strings, potentially padded)
|
|
68
|
+
# The ground truth from the dataset seems to be string integers already.
|
|
69
|
+
# Let's ensure the extracted guess is also treated as a simple integer string for comparison.
|
|
70
|
+
# The ground truth might not be zero-padded in the dataset, so compare integers.
|
|
71
|
+
try:
|
|
72
|
+
# Check if both can be converted to integers for comparison
|
|
73
|
+
return int(extracted_guess) == int(ground_truth_answer)
|
|
74
|
+
except ValueError:
|
|
75
|
+
# If conversion fails (e.g., empty string), they don't match
|
|
76
|
+
return False
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def run_evaluation() -> float:
|
|
80
|
+
"""Runs the evaluation on the dataset and returns the accuracy."""
|
|
81
|
+
correct = 0
|
|
82
|
+
start = time.time()
|
|
83
|
+
results = [] # Store results for potential later analysis if needed
|
|
84
|
+
|
|
85
|
+
with ThreadPoolExecutor(max_workers=NUM_WORKERS) as pool:
|
|
86
|
+
# Submit all tasks, passing the MODEL_TO_USE
|
|
87
|
+
futures = {
|
|
88
|
+
pool.submit(optimize.solve, row["Problem"], MODEL_TO_USE): row["Answer"] for row in DATA
|
|
89
|
+
} # Pass MODEL_TO_USE here
|
|
90
|
+
|
|
91
|
+
try:
|
|
92
|
+
# Process completed tasks
|
|
93
|
+
for idx, future in enumerate(as_completed(futures), 1):
|
|
94
|
+
problem_answer = futures[future] # Get the corresponding ground truth answer
|
|
95
|
+
try:
|
|
96
|
+
# Wait up to TASK_TIMEOUT seconds for each LLM call
|
|
97
|
+
llm_raw_output = future.result(timeout=TASK_TIMEOUT)
|
|
98
|
+
is_correct = grade_answer(llm_raw_output, str(problem_answer))
|
|
99
|
+
if is_correct:
|
|
100
|
+
correct += 1
|
|
101
|
+
results.append({"raw_output": llm_raw_output, "correct_answer": problem_answer, "is_correct": is_correct})
|
|
102
|
+
|
|
103
|
+
except Exception as exc:
|
|
104
|
+
print(f"[error] Generated an exception: {exc}")
|
|
105
|
+
results.append({"raw_output": f"Error: {exc}", "correct_answer": problem_answer, "is_correct": False})
|
|
106
|
+
|
|
107
|
+
if idx % LOG_EVERY == 0 or idx == TOTAL_SAMPLES:
|
|
108
|
+
elapsed = time.time() - start
|
|
109
|
+
current_accuracy = correct / idx if idx > 0 else 0
|
|
110
|
+
print(
|
|
111
|
+
f"[progress] {idx}/{TOTAL_SAMPLES} completed, accuracy: {current_accuracy:.4f}, elapsed {elapsed:.1f} s",
|
|
112
|
+
flush=True,
|
|
113
|
+
)
|
|
114
|
+
except concurrent.futures.TimeoutError:
|
|
115
|
+
# Abort any stuck LLM calls
|
|
116
|
+
print(f"[error] LLM call timed out after {TASK_TIMEOUT}s", flush=True)
|
|
117
|
+
# Cancel all pending futures and exit
|
|
118
|
+
for f in futures:
|
|
119
|
+
f.cancel()
|
|
120
|
+
print("Exiting due to timeout", file=sys.stderr)
|
|
121
|
+
sys.exit(1)
|
|
122
|
+
except KeyboardInterrupt:
|
|
123
|
+
print("\nEvaluation interrupted by user", file=sys.stderr)
|
|
124
|
+
sys.exit(1)
|
|
125
|
+
|
|
126
|
+
# Final accuracy calculation
|
|
127
|
+
total_evaluated = len(results)
|
|
128
|
+
final_accuracy = correct / total_evaluated if total_evaluated > 0 else 0
|
|
129
|
+
return final_accuracy
|
|
130
|
+
|
|
131
|
+
|
|
132
|
+
if __name__ == "__main__":
|
|
133
|
+
acc = run_evaluation()
|
|
134
|
+
# Weco parses this exact line format
|
|
135
|
+
print(f"accuracy: {acc:.4f}")
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# weco-cli/examples/prompt/optimize.py
|
|
2
|
+
"""
|
|
3
|
+
optimize.py
|
|
4
|
+
This module holds the prompt template and the LLM call.
|
|
5
|
+
Weco modifies this file to optimize the prompt instructions.
|
|
6
|
+
The model used for the LLM call is passed in from eval.py.
|
|
7
|
+
"""
|
|
8
|
+
|
|
9
|
+
from openai import OpenAI
|
|
10
|
+
|
|
11
|
+
client = OpenAI() # API key must be in OPENAI_API_KEY
|
|
12
|
+
# MODEL constant removed from here
|
|
13
|
+
|
|
14
|
+
PROMPT_TEMPLATE = """You are an expert competition mathematician tasked with solving an AIME problem.
|
|
15
|
+
The final answer must be a three-digit integer between 000 and 999, inclusive.
|
|
16
|
+
Please reason step-by-step towards the solution. Keep your reasoning concise.
|
|
17
|
+
Conclude your response with the final answer enclosed in \\boxed{{}}. For example: The final answer is \\boxed{{042}}.
|
|
18
|
+
|
|
19
|
+
Problem:
|
|
20
|
+
{problem}
|
|
21
|
+
|
|
22
|
+
Solution:
|
|
23
|
+
"""
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def solve(problem: str, model_name: str) -> str:
|
|
27
|
+
"""Return the model's raw text answer for one problem using the specified model."""
|
|
28
|
+
prompt = PROMPT_TEMPLATE.format(problem=problem)
|
|
29
|
+
|
|
30
|
+
response = client.chat.completions.create(
|
|
31
|
+
model=model_name, # Use the passed-in model name
|
|
32
|
+
messages=[{"role": "user", "content": prompt}],
|
|
33
|
+
)
|
|
34
|
+
return response.choices[0].message.content.strip()
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
# Weco Prompt Optimization Guidelines for AIME (Targeting GPT-4.1)
|
|
2
|
+
|
|
3
|
+
## 1. Goal
|
|
4
|
+
|
|
5
|
+
Your objective is to modify the the `optimize.py` file to improve the `accuracy` metric when solving AIME math problems. The modifications should leverage the capabilities of the target model, **GPT-4.1**.
|
|
6
|
+
|
|
7
|
+
## 2. Files and Workflow
|
|
8
|
+
|
|
9
|
+
* **Target File for Modification:** `optimize.py`. * **Evaluation Script:** `eval.py`. This script:
|
|
10
|
+
* Defines the actual LLM used for solving (`MODEL_TO_USE`, which is set to `gpt-4.1` in this context).
|
|
11
|
+
* Calls `optimize.solve(problem, model_name="gpt-4.1")`.
|
|
12
|
+
* Parses the output from `optimize.solve`. **Crucially, it expects the final 3-digit answer (000-999) to be enclosed in `\boxed{XXX}`.** For example: `\boxed{042}`. Your prompt modifications *must* ensure the model consistently produces this format for the final answer.
|
|
13
|
+
* Compares the extracted answer to the ground truth and prints the `accuracy:` metric, which Weco uses for guidance.
|
|
14
|
+
|
|
15
|
+
## 3. Target Model: GPT-4.1
|
|
16
|
+
|
|
17
|
+
You are optimizing the prompt for `gpt-4.1`. Based on its characteristics, consider the following:
|
|
18
|
+
|
|
19
|
+
* **Strengths:**
|
|
20
|
+
* **Significantly Improved Instruction Following:** GPT-4.1 is better at adhering to complex instructions, formats, and constraints compared to previous models. This is key for AIME where precision is vital. It excels on hard instruction-following tasks.
|
|
21
|
+
* **Stronger Coding & Reasoning:** Its improved coding performance (e.g., SWE-bench) suggests enhanced logical reasoning capabilities applicable to mathematical problem-solving.
|
|
22
|
+
* **Refreshed Knowledge:** Knowledge cutoff is June 2024.
|
|
23
|
+
* **Considerations:**
|
|
24
|
+
* **Literal Interpretation:** GPT-4.1 can be more literal. Prompts should be explicit and specific about the desired reasoning process and output format. Avoid ambiguity.
|
|
25
|
+
|
|
26
|
+
## 4. Optimization Strategies (Focus on `PROMPT_TEMPLATE` in `optimize.py`)
|
|
27
|
+
|
|
28
|
+
The primary goal is to enhance the model's reasoning process for these challenging math problems. Focus on Chain-of-Thought (CoT) designs within the `PROMPT_TEMPLATE`.
|
|
29
|
+
|
|
30
|
+
**Ideas to Explore:**
|
|
31
|
+
You don't have to implement all of them, but the following ideas might be helpful:
|
|
32
|
+
* **Workflow Patterns** try to use some of the following patterns:
|
|
33
|
+
* **Linear**: Linear workflow, standarded CoT E.g. considering the following thinking steps (you don't have to include all of them), "1. Understand the problem constraints. 2. Identify relevant theorems/formulas. 3. Formulate a plan. 4. Execute calculations step-by-step. 5. Verify intermediate results. 6. State the final answer in the required format."
|
|
34
|
+
* **List Candidates**: You can ask the model to propose a few solutions in a particular step and pick the best solution. You can potentially also set the criterias in the prompt.
|
|
35
|
+
* **Code** Use pesudo code to define even more complex workflows with loops, conditional statement, or go to statement.
|
|
36
|
+
* **Other CoT Techniques:**
|
|
37
|
+
* Self-Correction/Reflection
|
|
38
|
+
* Plan Generation
|
|
39
|
+
* Debate, simulating multiple characters
|
|
40
|
+
* Tree of thought
|
|
41
|
+
* **Few-Shot Examples:** You *could* experiment with adding 1-2 high-quality AIME problem/solution examples directly into the `PROMPT_TEMPLATE` string (similar to how Weco attempted in one of the runs). Ensure the examples clearly show the desired reasoning style and the final `\boxed{XXX}` format.
|
|
42
|
+
* **Play with format:** The way you format the prompt. Markdown, xml, json, code or natural language. Similarly for the thinking tokens themselves you can also try out different formats.
|
|
43
|
+
|
|
44
|
+
## 5. Constraints
|
|
45
|
+
* **Ensure the final output reliably contains `\boxed{XXX}` as the evaluation script depends on it.**
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# Example: Optimizing PyTorch Self-Attention with Triton
|
|
2
|
+
|
|
3
|
+
This example demonstrates using Weco to optimize a causal multi-head self-attention mechanism, a core component of Transformer models, implemented in PyTorch. The optimization target is to leverage [Triton](https://github.com/triton-lang/triton), a language and compiler for writing highly efficient GPU code, to accelerate the operation.
|
|
4
|
+
|
|
5
|
+
## Setup
|
|
6
|
+
|
|
7
|
+
1. Ensure you are in the `examples/triton` directory.
|
|
8
|
+
2. Install the required dependencies:
|
|
9
|
+
```bash
|
|
10
|
+
pip install torch triton
|
|
11
|
+
```
|
|
12
|
+
*(Note: Triton installation might require specific CUDA versions. Refer to the official Triton documentation if you encounter issues.)*
|
|
13
|
+
|
|
14
|
+
## Optimization Command
|
|
15
|
+
|
|
16
|
+
Run the following command to start the optimization process:
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
weco --source optimize.py \
|
|
20
|
+
--eval-command "python evaluate.py --solution-path optimize.py" \
|
|
21
|
+
--metric speedup \
|
|
22
|
+
--maximize true \
|
|
23
|
+
--steps 30 \
|
|
24
|
+
--model gemini-2.5-pro-exp-03-25 \
|
|
25
|
+
--additional-instructions "Use triton to optimize the code while ensuring a small max float diff. Maintain the same code format."
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### Explanation
|
|
29
|
+
|
|
30
|
+
* `--source optimize.py`: The PyTorch self-attention implementation to be optimized.
|
|
31
|
+
* `--eval-command "python evaluate.py --solution-path optimize.py"`: Executes the evaluation script, which benchmarks the `optimize.py` code against a baseline and prints the `speedup`.
|
|
32
|
+
* `--metric speedup`: The target metric for optimization.
|
|
33
|
+
* `--maximize true`: Weco should maximize the speedup.
|
|
34
|
+
* `--steps 30`: The number of optimization iterations.
|
|
35
|
+
* `--model gemini-2.5-pro-exp-03-25`: The LLM driving the optimization.
|
|
36
|
+
* `--additional-instructions "..."`: Provides specific guidance to the LLM, instructing it to use Triton, maintain numerical accuracy ("small max float diff"), and preserve the code structure.
|
|
37
|
+
|
|
38
|
+
Weco will iteratively refine `optimize.py` using Triton, guided by the evaluation results and the provided instructions.
|
|
@@ -10,7 +10,7 @@ authors = [
|
|
|
10
10
|
]
|
|
11
11
|
description = "Documentation for `weco`, a CLI for using Weco AI's code optimizer."
|
|
12
12
|
readme = "README.md"
|
|
13
|
-
version = "0.2.
|
|
13
|
+
version = "0.2.9"
|
|
14
14
|
license = {text = "MIT"}
|
|
15
15
|
requires-python = ">=3.8"
|
|
16
16
|
dependencies = ["requests", "rich"]
|
|
@@ -57,6 +57,11 @@ def main() -> None:
|
|
|
57
57
|
type=str,
|
|
58
58
|
help="Description of additional instruction or path to a file containing additional instructions",
|
|
59
59
|
)
|
|
60
|
+
parser.add_argument(
|
|
61
|
+
"--preserve-source",
|
|
62
|
+
action="store_true",
|
|
63
|
+
help="If set, do not overwrite the original source file; only save modified versions in the runs directory",
|
|
64
|
+
)
|
|
60
65
|
args = parser.parse_args()
|
|
61
66
|
|
|
62
67
|
try:
|
|
@@ -73,15 +78,16 @@ def main() -> None:
|
|
|
73
78
|
"debug_prob": 0.5,
|
|
74
79
|
"max_debug_depth": max(1, math.ceil(0.1 * steps)), # 10% of steps
|
|
75
80
|
}
|
|
81
|
+
# Read API keys
|
|
82
|
+
api_keys = read_api_keys_from_env()
|
|
83
|
+
# API request timeout
|
|
84
|
+
timeout = 800
|
|
85
|
+
|
|
76
86
|
# Read additional instructions
|
|
77
87
|
additional_instructions = read_additional_instructions(additional_instructions=args.additional_instructions)
|
|
78
88
|
# Read source code
|
|
79
89
|
source_fp = pathlib.Path(args.source)
|
|
80
90
|
source_code = read_from_path(fp=source_fp, is_json=False)
|
|
81
|
-
# Read API keys
|
|
82
|
-
api_keys = read_api_keys_from_env()
|
|
83
|
-
# API request timeout
|
|
84
|
-
timeout = 800
|
|
85
91
|
|
|
86
92
|
# Initialize panels
|
|
87
93
|
summary_panel = SummaryPanel(
|
|
@@ -124,7 +130,8 @@ def main() -> None:
|
|
|
124
130
|
|
|
125
131
|
# Write the code string to the source file path
|
|
126
132
|
# Do this after the original code is saved
|
|
127
|
-
|
|
133
|
+
if not args.preserve_source:
|
|
134
|
+
write_to_path(fp=source_fp, content=session_response["code"])
|
|
128
135
|
|
|
129
136
|
# Update the panels with the initial solution
|
|
130
137
|
# Add session id now that we have it
|
|
@@ -191,20 +198,25 @@ def main() -> None:
|
|
|
191
198
|
)
|
|
192
199
|
|
|
193
200
|
for step in range(1, steps):
|
|
201
|
+
# Re-read instructions from the original source (file path or string) BEFORE each suggest call
|
|
202
|
+
current_additional_instructions = read_additional_instructions(
|
|
203
|
+
additional_instructions=args.additional_instructions
|
|
204
|
+
)
|
|
194
205
|
# Evaluate the current output and get the next solution
|
|
195
206
|
eval_and_next_solution_response = evaluate_feedback_then_suggest_next_solution(
|
|
196
207
|
console=console,
|
|
197
208
|
session_id=session_id,
|
|
198
209
|
execution_output=term_out,
|
|
199
|
-
additional_instructions=
|
|
210
|
+
additional_instructions=current_additional_instructions,
|
|
200
211
|
api_keys=api_keys,
|
|
201
212
|
timeout=timeout,
|
|
202
213
|
)
|
|
203
214
|
# Save next solution (.runs/<session-id>/step_<step>.<extension>)
|
|
204
|
-
write_to_path(fp=runs_dir / f"step_{step}
|
|
215
|
+
write_to_path(fp=runs_dir / f"step_{step}{source_fp.suffix}", content=eval_and_next_solution_response["code"])
|
|
205
216
|
|
|
206
217
|
# Write the next solution to the source file
|
|
207
|
-
|
|
218
|
+
if not args.preserve_source:
|
|
219
|
+
write_to_path(fp=source_fp, content=eval_and_next_solution_response["code"])
|
|
208
220
|
|
|
209
221
|
# Get the optimization session status for
|
|
210
222
|
# the best solution, its score, and the history to plot the tree
|
|
@@ -283,12 +295,16 @@ def main() -> None:
|
|
|
283
295
|
transition_delay=0.1, # Slightly longer delay for evaluation results
|
|
284
296
|
)
|
|
285
297
|
|
|
298
|
+
# Re-read instructions before the final feedback step
|
|
299
|
+
current_additional_instructions = read_additional_instructions(
|
|
300
|
+
additional_instructions=args.additional_instructions
|
|
301
|
+
)
|
|
286
302
|
# Ensure we pass evaluation results for the last step's generated solution
|
|
287
303
|
eval_and_next_solution_response = evaluate_feedback_then_suggest_next_solution(
|
|
288
304
|
console=console,
|
|
289
305
|
session_id=session_id,
|
|
290
306
|
execution_output=term_out,
|
|
291
|
-
additional_instructions=
|
|
307
|
+
additional_instructions=current_additional_instructions,
|
|
292
308
|
api_keys=api_keys,
|
|
293
309
|
timeout=timeout,
|
|
294
310
|
)
|
|
@@ -355,7 +371,8 @@ def main() -> None:
|
|
|
355
371
|
write_to_path(fp=runs_dir / f"best.{source_fp.suffix}", content=best_solution_content)
|
|
356
372
|
|
|
357
373
|
# write the best solution to the source file
|
|
358
|
-
|
|
374
|
+
if not args.preserve_source:
|
|
375
|
+
write_to_path(fp=source_fp, content=best_solution_content)
|
|
359
376
|
|
|
360
377
|
console.print(end_optimization_layout)
|
|
361
378
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: weco
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.9
|
|
4
4
|
Summary: Documentation for `weco`, a CLI for using Weco AI's code optimizer.
|
|
5
5
|
Author-email: Weco AI Team <contact@weco.ai>
|
|
6
6
|
License: MIT
|
|
@@ -76,7 +76,7 @@ The `weco` CLI leverages a tree search approach guided by Large Language Models
|
|
|
76
76
|
|
|
77
77
|
This basic example shows how to optimize a simple PyTorch function for speedup.
|
|
78
78
|
|
|
79
|
-
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)
|
|
79
|
+
For more advanced examples, including **[Metal/MLX](/examples/metal/README.md), [Triton](/examples/triton/README.md), [CUDA kernel optimization](/examples/cuda/README.md)**, and **[ML model optimization](/examples/spaceship-titanic/README.md)**, please see the `README.md` files within the corresponding subdirectories under the [`examples/`](./examples/) folder.
|
|
80
80
|
|
|
81
81
|
```bash
|
|
82
82
|
# Navigate to the example directory
|
|
@@ -108,9 +108,10 @@ weco --source optimize.py \
|
|
|
108
108
|
| `--metric` | The name of the metric you want to optimize (e.g., 'accuracy', 'speedup', 'loss'). This metric name should match what's printed by your `--eval-command`. | Yes |
|
|
109
109
|
| `--maximize` | Whether to maximize (`true`) or minimize (`false`) the metric. | Yes |
|
|
110
110
|
| `--steps` | Number of optimization steps (LLM iterations) to run. | Yes |
|
|
111
|
-
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.
|
|
111
|
+
| `--model` | Model identifier for the LLM to use (e.g., `gpt-4o`, `claude-3.7-sonnet`). Recommended models to try include `o4-mini`, and `gemini-2.5-pro-exp-03-25`.| Yes |
|
|
112
112
|
| `--additional-instructions` | (Optional) Natural language description of specific instructions OR path to a file containing detailed instructions to guide the LLM. | No |
|
|
113
113
|
| `--log-dir` | (Optional) Path to the directory to log intermediate steps and final optimization result. Defaults to `.runs/`. | No |
|
|
114
|
+
| `--preserve-source` | (Optional) If set, do not overwrite the original `--source` file. Modifications and the best solution will still be saved in the `--log-dir`. | No |
|
|
114
115
|
|
|
115
116
|
---
|
|
116
117
|
|
|
@@ -14,6 +14,10 @@ examples/metal/README.md
|
|
|
14
14
|
examples/metal/evaluate.py
|
|
15
15
|
examples/metal/examples.rst
|
|
16
16
|
examples/metal/optimize.py
|
|
17
|
+
examples/prompt/README.md
|
|
18
|
+
examples/prompt/eval.py
|
|
19
|
+
examples/prompt/optimize.py
|
|
20
|
+
examples/prompt/prompt_guide.md
|
|
17
21
|
examples/spaceship-titanic/README.md
|
|
18
22
|
examples/spaceship-titanic/baseline.py
|
|
19
23
|
examples/spaceship-titanic/evaluate.py
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|