swegen 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- swegen/__init__.py +14 -0
- swegen/analyze/__init__.py +24 -0
- swegen/analyze/classifier.py +637 -0
- swegen/analyze/classify_prompt.txt +241 -0
- swegen/analyze/models.py +253 -0
- swegen/analyze/run.py +656 -0
- swegen/analyze/verdict_prompt.txt +126 -0
- swegen/cli.py +411 -0
- swegen/config.py +142 -0
- swegen/create/__init__.py +22 -0
- swegen/create/claude_code_runner.py +988 -0
- swegen/create/claude_code_utils.py +95 -0
- swegen/create/create.py +706 -0
- swegen/create/diff_utils.py +142 -0
- swegen/create/orchestrator.py +368 -0
- swegen/create/pr_fetcher.py +187 -0
- swegen/create/repo_cache.py +175 -0
- swegen/create/task_instruction.py +363 -0
- swegen/create/task_reference.py +130 -0
- swegen/create/task_skeleton.py +266 -0
- swegen/create/utils.py +350 -0
- swegen/farm/__init__.py +13 -0
- swegen/farm/farm_hand.py +342 -0
- swegen/farm/fetcher.py +341 -0
- swegen/farm/state.py +231 -0
- swegen/farm/stream_farm.py +430 -0
- swegen/tools/__init__.py +16 -0
- swegen/tools/harbor_runner.py +191 -0
- swegen/tools/validate.py +523 -0
- swegen/tools/validate_utils.py +142 -0
- swegen-0.1.0.dist-info/METADATA +292 -0
- swegen-0.1.0.dist-info/RECORD +35 -0
- swegen-0.1.0.dist-info/WHEEL +4 -0
- swegen-0.1.0.dist-info/entry_points.txt +3 -0
- swegen-0.1.0.dist-info/licenses/LICENSE +201 -0
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
You are synthesizing multiple independent trial analyses into a final task quality verdict.
|
|
2
|
+
|
|
3
|
+
## Critical Calibration: Hard Tasks are GOOD
|
|
4
|
+
|
|
5
|
+
**Hard tasks with low pass rates are DESIRABLE for benchmarks.** A 20-40% pass rate indicates a well-calibrated task. Do NOT penalize tasks for being difficult.
|
|
6
|
+
|
|
7
|
+
- 0% pass rate: Could be too hard OR could be a task problem (investigate)
|
|
8
|
+
- 10-40% pass rate: IDEAL for benchmark tasks - challenging but solvable
|
|
9
|
+
- 50-70% pass rate: Good but possibly too easy
|
|
10
|
+
- 80-100% pass rate: Too easy for benchmarking (unless intended)
|
|
11
|
+
|
|
12
|
+
**Failures are EXPECTED.** The default classification for failures should be GOOD_FAILURE (agent's fault) unless there's strong evidence the task itself is broken.
|
|
13
|
+
|
|
14
|
+
## Context
|
|
15
|
+
|
|
16
|
+
You have {num_trials} trial runs of the same Harbor task. Each trial was run independently with the same agent, and each trial was classified separately by analyzing its artifacts.
|
|
17
|
+
|
|
18
|
+
**Baseline Validation:**
|
|
19
|
+
{baseline_summary}
|
|
20
|
+
|
|
21
|
+
**Static Quality Check:**
|
|
22
|
+
{quality_check_summary}
|
|
23
|
+
|
|
24
|
+
## Individual Trial Classifications
|
|
25
|
+
|
|
26
|
+
{trial_classifications}
|
|
27
|
+
|
|
28
|
+
## Your Goal
|
|
29
|
+
|
|
30
|
+
Synthesize these independent analyses into a **final verdict** on the task quality.
|
|
31
|
+
|
|
32
|
+
Consider:
|
|
33
|
+
1. **Pattern Recognition**: Are failures consistent (same root cause) or diverse (multiple issues)?
|
|
34
|
+
2. **Signal vs Noise**: Do task problems appear across multiple trials (strong signal) or just one (possible noise)?
|
|
35
|
+
3. **Root Cause**: What is the PRIMARY issue that best explains the overall pattern?
|
|
36
|
+
4. **Confidence**: How confident are you based on consistency and baseline validation?
|
|
37
|
+
5. **Actionability**: What are the most important fixes needed (if any)?
|
|
38
|
+
6. **Solvability**: Did ANY trial succeed? If yes, the task IS solvable - failures are agent variance.
|
|
39
|
+
7. **Over-classification check**: Is "Underspecified Instruction" just meaning "task is hard"? That's not a problem.
|
|
40
|
+
|
|
41
|
+
## Classification Categories
|
|
42
|
+
|
|
43
|
+
**Task problems** (need fixing):
|
|
44
|
+
- BAD_FAILURE: Task specification issues (underspecified instruction, brittle tests, ambiguous requirements, etc.)
|
|
45
|
+
- BAD_SUCCESS: Cheating/gaming (hardcoding, test inspection, oracle copying, tests too permissive)
|
|
46
|
+
|
|
47
|
+
**Normal outcomes** (task is fine):
|
|
48
|
+
- GOOD_SUCCESS: Agent solved it legitimately
|
|
49
|
+
- GOOD_FAILURE: Agent couldn't solve it due to agent limitations (timeout, wrong approach, complexity, etc.)
|
|
50
|
+
- HARNESS_ERROR: Infrastructure issues (not task's fault)
|
|
51
|
+
|
|
52
|
+
## Confidence Levels
|
|
53
|
+
|
|
54
|
+
- **high**: All trials agree OR baseline validation critical failure OR clear consistent pattern
|
|
55
|
+
- **medium**: Majority of trials agree (>50%) OR mixed signals but leaning one way OR no successes but failures look legitimate
|
|
56
|
+
- **low**: Trials disagree OR unclear pattern OR only harness errors
|
|
57
|
+
|
|
58
|
+
## Decision Logic
|
|
59
|
+
|
|
60
|
+
**Default assumption: Task is GOOD.** Only mark as bad with strong evidence.
|
|
61
|
+
|
|
62
|
+
1. **Critical baseline failures** → is_good=false, high confidence
|
|
63
|
+
- If nop passed: "task may be pre-solved"
|
|
64
|
+
- If oracle failed: "reference solution doesn't work"
|
|
65
|
+
|
|
66
|
+
2. **Consistent STRONG task problems** → is_good=false
|
|
67
|
+
- Majority (>50%) show the SAME BAD_FAILURE subtype with clear evidence → medium confidence
|
|
68
|
+
- All trials show task problems AND evidence is compelling → high confidence
|
|
69
|
+
- NOTE: "Underspecified Instruction" alone is often OVER-diagnosed. Scrutinize this carefully.
|
|
70
|
+
|
|
71
|
+
3. **Some successes, mixed failures** → is_good=true, medium-high confidence
|
|
72
|
+
- Having ANY success means the task IS solvable
|
|
73
|
+
- Mixed failures are normal for hard tasks
|
|
74
|
+
|
|
75
|
+
4. **All failures but mostly GOOD_FAILURE** → is_good=true, medium confidence
|
|
76
|
+
- Task is hard but not broken
|
|
77
|
+
- This is the IDEAL outcome for a challenging benchmark task
|
|
78
|
+
|
|
79
|
+
5. **Mix of GOOD_FAILURE and scattered BAD_FAILURE** → is_good=true, medium confidence
|
|
80
|
+
- Isolated BAD_FAILURE classifications may be noise/over-classification
|
|
81
|
+
- Default to trusting the task unless pattern is clear
|
|
82
|
+
|
|
83
|
+
6. **Any successes** → is_good=true, medium-high confidence
|
|
84
|
+
- Task is clearly solvable
|
|
85
|
+
|
|
86
|
+
## Primary Issue Format
|
|
87
|
+
|
|
88
|
+
If is_good=false, write a clear primary issue:
|
|
89
|
+
- For consistent problem: "{{count}}/{{total}} trials show: {{most_common_subtype}} - {{brief_explanation}}"
|
|
90
|
+
- For diverse problems: "{{count}}/{{total}} trials show task issues: {{list_of_problems}}"
|
|
91
|
+
- For baseline: "CRITICAL: {{baseline_issue}}"
|
|
92
|
+
|
|
93
|
+
If is_good=true:
|
|
94
|
+
- null (or omit)
|
|
95
|
+
|
|
96
|
+
## Recommendations
|
|
97
|
+
|
|
98
|
+
For BAD tasks, provide 3-5 **specific, actionable** recommendations:
|
|
99
|
+
- Prioritize issues that appeared in multiple trials
|
|
100
|
+
- Focus on fixes that address the root cause
|
|
101
|
+
- Be concrete (e.g., "Add file path to instruction.md line 15" not "improve instructions")
|
|
102
|
+
- Deduplicate similar recommendations across trials
|
|
103
|
+
|
|
104
|
+
For GOOD tasks, recommendations should be empty or minor suggestions.
|
|
105
|
+
|
|
106
|
+
## Output Format
|
|
107
|
+
|
|
108
|
+
Return ONLY valid JSON (no markdown, no code blocks, no explanation):
|
|
109
|
+
|
|
110
|
+
{{
|
|
111
|
+
"is_good": true/false,
|
|
112
|
+
"confidence": "high" | "medium" | "low",
|
|
113
|
+
"primary_issue": "clear description of main problem" or null,
|
|
114
|
+
"recommendations": ["specific fix 1", "specific fix 2", ...],
|
|
115
|
+
"reasoning": "1-2 sentence explanation of your verdict based on the trial pattern"
|
|
116
|
+
}}
|
|
117
|
+
|
|
118
|
+
## Important Notes
|
|
119
|
+
|
|
120
|
+
- **Default to GOOD**: Tasks that passed baseline validation are presumed good. Failures are expected.
|
|
121
|
+
- **Consistency matters**: 1/4 trials showing issue is noise, 3/4 with SAME issue is a signal
|
|
122
|
+
- **Baseline validation overrides**: Critical baseline failures always mean is_good=false
|
|
123
|
+
- **Be skeptical of BAD_FAILURE**: "Underspecified Instruction" is often over-diagnosed. Ask: could the agent have figured this out by exploring the codebase?
|
|
124
|
+
- **Low pass rates are GOOD**: 25% pass rate on a hard task is ideal for benchmarking
|
|
125
|
+
- **Successes prove solvability**: Any success means the task CAN be solved with the given instruction
|
|
126
|
+
|
swegen/cli.py
ADDED
|
@@ -0,0 +1,411 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from importlib.metadata import PackageNotFoundError as _PkgNotFound
|
|
4
|
+
from importlib.metadata import version as _pkg_version
|
|
5
|
+
from pathlib import Path
|
|
6
|
+
|
|
7
|
+
import typer
|
|
8
|
+
import json
|
|
9
|
+
from dotenv import load_dotenv
|
|
10
|
+
from harbor.models.environment_type import EnvironmentType
|
|
11
|
+
from rich.console import Console
|
|
12
|
+
|
|
13
|
+
from swegen.config import CreateConfig, FarmConfig
|
|
14
|
+
from swegen.create import MissingIssueError, TrivialPRError
|
|
15
|
+
from swegen.create.create import run_reversal
|
|
16
|
+
from swegen.farm import StreamFarmer
|
|
17
|
+
from swegen.analyze import AnalyzeArgs, run_analyze, TrialClassifier, write_trial_analysis_files
|
|
18
|
+
from swegen.tools.validate import ValidateArgs, run_validate
|
|
19
|
+
from swegen.tools.validate_utils import ValidationError
|
|
20
|
+
|
|
21
|
+
load_dotenv()
|
|
22
|
+
|
|
23
|
+
app = typer.Typer(no_args_is_help=True, add_completion=False, help="Task generation CLI")
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
@app.callback(invoke_without_command=True)
|
|
27
|
+
def _root(
|
|
28
|
+
version: bool = typer.Option(
|
|
29
|
+
False,
|
|
30
|
+
"--version",
|
|
31
|
+
"-V",
|
|
32
|
+
help="Show swegen version and exit",
|
|
33
|
+
is_eager=True,
|
|
34
|
+
),
|
|
35
|
+
) -> None:
|
|
36
|
+
if version:
|
|
37
|
+
try:
|
|
38
|
+
typer.echo(f"swegen {_pkg_version('swe-gen')}")
|
|
39
|
+
except _PkgNotFound:
|
|
40
|
+
typer.echo("swegen (version unknown)")
|
|
41
|
+
raise typer.Exit()
|
|
42
|
+
|
|
43
|
+
|
|
44
|
+
create_app = typer.Typer(
|
|
45
|
+
no_args_is_help=True,
|
|
46
|
+
invoke_without_command=True,
|
|
47
|
+
add_completion=False,
|
|
48
|
+
help="Create a Harbor task from a merged PR and validate",
|
|
49
|
+
)
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
@create_app.callback()
|
|
53
|
+
def create_cmd(
|
|
54
|
+
repo: str = typer.Option(..., help="GitHub repository (owner/repo or URL)"),
|
|
55
|
+
pr: int = typer.Option(..., help="PR number"),
|
|
56
|
+
output: Path = typer.Option(Path("tasks"), help="Output root", show_default=True),
|
|
57
|
+
cc_timeout: int = typer.Option(
|
|
58
|
+
3200, help="Timeout for CC session in seconds (~53 min default)", show_default=True
|
|
59
|
+
),
|
|
60
|
+
validate: bool = typer.Option(
|
|
61
|
+
True, help="Run Harbor validations; --no-validate skips validation"
|
|
62
|
+
),
|
|
63
|
+
force: bool = typer.Option(False, help="Bypass local dedupe and regenerate"),
|
|
64
|
+
state_dir: Path = typer.Option(
|
|
65
|
+
Path(".state"), help="Local dedupe state dir", show_default=True
|
|
66
|
+
),
|
|
67
|
+
no_cache: bool = typer.Option(
|
|
68
|
+
False, "--no-cache", help="Disable reusing cached Dockerfiles/test.sh from previous tasks"
|
|
69
|
+
),
|
|
70
|
+
require_minimum_difficulty: bool = typer.Option(
|
|
71
|
+
True,
|
|
72
|
+
help="Require minimum difficulty (3+ source files); --no-require-minimum-difficulty to skip this check",
|
|
73
|
+
),
|
|
74
|
+
min_source_files: int = typer.Option(
|
|
75
|
+
3, help="Minimum number of source files required (tests excluded)", show_default=True
|
|
76
|
+
),
|
|
77
|
+
max_source_files: int = typer.Option(
|
|
78
|
+
10,
|
|
79
|
+
help="Maximum number of source files to avoid large refactors (tests excluded)",
|
|
80
|
+
show_default=True,
|
|
81
|
+
),
|
|
82
|
+
require_issue: bool = typer.Option(
|
|
83
|
+
True,
|
|
84
|
+
help="Require PR to have a linked issue (higher quality instructions); --no-require-issue uses PR body/title instead",
|
|
85
|
+
),
|
|
86
|
+
allow_unmerged: bool = typer.Option(
|
|
87
|
+
False,
|
|
88
|
+
help="Allow processing unmerged PRs (for testing/preview); --allow-unmerged to enable",
|
|
89
|
+
),
|
|
90
|
+
environment: str = typer.Option(
|
|
91
|
+
"docker",
|
|
92
|
+
"-e",
|
|
93
|
+
"--env",
|
|
94
|
+
help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
|
|
95
|
+
show_default=True,
|
|
96
|
+
),
|
|
97
|
+
verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
|
|
98
|
+
quiet: bool = typer.Option(False, "-q", "--quiet", help="Reduce output verbosity"),
|
|
99
|
+
) -> None:
|
|
100
|
+
config = CreateConfig(
|
|
101
|
+
repo=repo,
|
|
102
|
+
pr=pr,
|
|
103
|
+
output=output,
|
|
104
|
+
cc_timeout=cc_timeout,
|
|
105
|
+
validate=validate,
|
|
106
|
+
force=force,
|
|
107
|
+
state_dir=state_dir,
|
|
108
|
+
use_cache=not no_cache,
|
|
109
|
+
require_minimum_difficulty=require_minimum_difficulty,
|
|
110
|
+
min_source_files=min_source_files,
|
|
111
|
+
max_source_files=max_source_files,
|
|
112
|
+
require_issue=require_issue,
|
|
113
|
+
allow_unmerged=allow_unmerged,
|
|
114
|
+
environment=EnvironmentType(environment),
|
|
115
|
+
verbose=verbose,
|
|
116
|
+
quiet=quiet,
|
|
117
|
+
)
|
|
118
|
+
try:
|
|
119
|
+
run_reversal(config)
|
|
120
|
+
except (TrivialPRError, MissingIssueError, ValidationError, FileExistsError) as err:
|
|
121
|
+
# These exceptions have already displayed user-friendly messages
|
|
122
|
+
# Exit with error code but don't show traceback
|
|
123
|
+
raise SystemExit(1) from err
|
|
124
|
+
|
|
125
|
+
|
|
126
|
+
app.add_typer(create_app, name="create")
|
|
127
|
+
|
|
128
|
+
|
|
129
|
+
@app.command(help="Validate an existing Harbor task by running NOP and ORACLE")
|
|
130
|
+
def validate(
|
|
131
|
+
path: Path = typer.Argument(
|
|
132
|
+
...,
|
|
133
|
+
help="Path to Harbor dataset root, specific task directory, or task ID when used with dataset root",
|
|
134
|
+
),
|
|
135
|
+
task: str
|
|
136
|
+
| None = typer.Option(None, "--task", "-t", help="Task ID when --path points to dataset root"),
|
|
137
|
+
agent: str = typer.Option("both", help="Agent to run: both|nop|oracle", show_default=True),
|
|
138
|
+
jobs_dir: Path = typer.Option(
|
|
139
|
+
Path(".state/harbor-jobs"),
|
|
140
|
+
help="Directory to store Harbor job artifacts",
|
|
141
|
+
show_default=True,
|
|
142
|
+
),
|
|
143
|
+
timeout_multiplier: float
|
|
144
|
+
| None = typer.Option(None, help="Multiply default timeouts (e.g., 3.0)"),
|
|
145
|
+
environment: str = typer.Option(
|
|
146
|
+
"docker",
|
|
147
|
+
"-e",
|
|
148
|
+
"--env",
|
|
149
|
+
help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
|
|
150
|
+
show_default=True,
|
|
151
|
+
),
|
|
152
|
+
verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
|
|
153
|
+
quiet: bool = typer.Option(False, "-q", "--quiet", help="Reduce output verbosity"),
|
|
154
|
+
max_parallel: int = typer.Option(
|
|
155
|
+
8, help="Maximum number of parallel validations (batch mode only)", show_default=True
|
|
156
|
+
),
|
|
157
|
+
show_passed: bool = typer.Option(
|
|
158
|
+
False,
|
|
159
|
+
"--show-passed",
|
|
160
|
+
help="Show passed tasks in output (batch mode: default shows only failures)",
|
|
161
|
+
),
|
|
162
|
+
output: Path
|
|
163
|
+
| None = typer.Option(
|
|
164
|
+
None, "-o", "--output", help="Write results to file as they complete (batch mode only)"
|
|
165
|
+
),
|
|
166
|
+
docker_prune_batch: int = typer.Option(
|
|
167
|
+
5,
|
|
168
|
+
help="Run docker cleanup after every N tasks (0 to disable, local docker only)",
|
|
169
|
+
show_default=True,
|
|
170
|
+
),
|
|
171
|
+
) -> None:
|
|
172
|
+
if agent not in ("both", "nop", "oracle"):
|
|
173
|
+
raise typer.BadParameter("agent must be one of: both, nop, oracle")
|
|
174
|
+
run_validate(
|
|
175
|
+
ValidateArgs(
|
|
176
|
+
path=path,
|
|
177
|
+
task=task,
|
|
178
|
+
jobs_dir=jobs_dir,
|
|
179
|
+
agent=agent,
|
|
180
|
+
timeout_multiplier=timeout_multiplier,
|
|
181
|
+
verbose=verbose,
|
|
182
|
+
quiet=quiet,
|
|
183
|
+
environment=EnvironmentType(environment),
|
|
184
|
+
max_parallel=max_parallel,
|
|
185
|
+
show_passed=show_passed,
|
|
186
|
+
output_file=output,
|
|
187
|
+
docker_prune_batch=docker_prune_batch,
|
|
188
|
+
)
|
|
189
|
+
)
|
|
190
|
+
|
|
191
|
+
|
|
192
|
+
@app.command(help="Analyze a task by running agent trials and classifying outcomes")
|
|
193
|
+
def analyze(
|
|
194
|
+
path: Path = typer.Argument(..., help="Path to the task directory to analyze"),
|
|
195
|
+
agent: str = typer.Option(
|
|
196
|
+
"claude-code", "-a", "--agent", help="Agent to run trials with", show_default=True
|
|
197
|
+
),
|
|
198
|
+
model: str = typer.Option(
|
|
199
|
+
"anthropic/claude-sonnet-4-5",
|
|
200
|
+
"-m",
|
|
201
|
+
"--model",
|
|
202
|
+
help="Model to use for agent trials",
|
|
203
|
+
show_default=True,
|
|
204
|
+
),
|
|
205
|
+
n_trials: int = typer.Option(
|
|
206
|
+
3, "-k", "--n-trials", help="Number of trials to run", show_default=True
|
|
207
|
+
),
|
|
208
|
+
n_concurrent: int = typer.Option(
|
|
209
|
+
3, "-n", "--n-concurrent", help="Number of concurrent trials (1=sequential, 3-5 recommended)", show_default=True
|
|
210
|
+
),
|
|
211
|
+
jobs_dir: Path = typer.Option(
|
|
212
|
+
Path(".state/analyze-jobs"),
|
|
213
|
+
"--jobs-dir",
|
|
214
|
+
help="Directory to store job artifacts",
|
|
215
|
+
show_default=True,
|
|
216
|
+
),
|
|
217
|
+
skip_quality_check: bool = typer.Option(
|
|
218
|
+
False, "--skip-quality-check", help="Skip static quality check"
|
|
219
|
+
),
|
|
220
|
+
skip_baseline: bool = typer.Option(
|
|
221
|
+
False, "--skip-baseline", help="Skip baseline validation (nop/oracle)"
|
|
222
|
+
),
|
|
223
|
+
skip_classify: bool = typer.Option(
|
|
224
|
+
False, "--skip-classify", help="Skip LLM classification of trial outcomes"
|
|
225
|
+
),
|
|
226
|
+
analysis_model: str = typer.Option(
|
|
227
|
+
"claude-sonnet-4-5",
|
|
228
|
+
"--analysis-model",
|
|
229
|
+
help="Model for Claude Code classification",
|
|
230
|
+
show_default=True,
|
|
231
|
+
),
|
|
232
|
+
timeout_multiplier: float = typer.Option(
|
|
233
|
+
1.0, "--timeout-multiplier", help="Multiply default timeouts", show_default=True
|
|
234
|
+
),
|
|
235
|
+
environment: str = typer.Option(
|
|
236
|
+
"docker",
|
|
237
|
+
"-e",
|
|
238
|
+
"--env",
|
|
239
|
+
help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
|
|
240
|
+
show_default=True,
|
|
241
|
+
),
|
|
242
|
+
verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
|
|
243
|
+
classification_timeout: int = typer.Option(
|
|
244
|
+
300,
|
|
245
|
+
"--classification-timeout",
|
|
246
|
+
help="Timeout per trial classification in seconds",
|
|
247
|
+
show_default=True,
|
|
248
|
+
),
|
|
249
|
+
verdict_timeout: int = typer.Option(
|
|
250
|
+
180,
|
|
251
|
+
"--verdict-timeout",
|
|
252
|
+
help="Timeout for verdict synthesis in seconds",
|
|
253
|
+
show_default=True,
|
|
254
|
+
),
|
|
255
|
+
save_to_dir: bool = typer.Option(
|
|
256
|
+
False,
|
|
257
|
+
"--save-to-dir",
|
|
258
|
+
help="Write trajectory-analysis.{md,json} to each trial directory",
|
|
259
|
+
),
|
|
260
|
+
) -> None:
|
|
261
|
+
"""
|
|
262
|
+
Analyze a Harbor task to determine if it's well-specified.
|
|
263
|
+
|
|
264
|
+
This command classifies trial outcomes to identify TASK PROBLEMS vs AGENT PROBLEMS:
|
|
265
|
+
|
|
266
|
+
1. Static quality check (Harbor's tasks check)
|
|
267
|
+
2. Baseline validation (nop should fail, oracle should pass)
|
|
268
|
+
3. Run N agent trials (default: 3 with Claude Code)
|
|
269
|
+
4. Classify each trial outcome:
|
|
270
|
+
- GOOD_SUCCESS: Agent solved it correctly
|
|
271
|
+
- BAD_SUCCESS: Agent cheated or tests too permissive
|
|
272
|
+
- GOOD_FAILURE: Agent failed due to its own limitations
|
|
273
|
+
- BAD_FAILURE: Agent failed due to task issues
|
|
274
|
+
- HARNESS_ERROR: Infrastructure problem
|
|
275
|
+
5. Compute task verdict with recommendations
|
|
276
|
+
|
|
277
|
+
The goal is to identify tasks that need fixing before release.
|
|
278
|
+
|
|
279
|
+
Flags match Harbor CLI conventions:
|
|
280
|
+
-k / --n-trials: Total number of trials to run
|
|
281
|
+
-n / --n-concurrent: Number of trials to run concurrently (parallelism)
|
|
282
|
+
|
|
283
|
+
Examples:
|
|
284
|
+
# Sequential (default)
|
|
285
|
+
swegen analyze tasks/my-task -k 5
|
|
286
|
+
|
|
287
|
+
# Parallel (3 trials at once)
|
|
288
|
+
swegen analyze tasks/my-task -k 10 -n 3
|
|
289
|
+
"""
|
|
290
|
+
run_analyze(
|
|
291
|
+
AnalyzeArgs(
|
|
292
|
+
task_path=path,
|
|
293
|
+
agent=agent,
|
|
294
|
+
model=model,
|
|
295
|
+
n_trials=n_trials,
|
|
296
|
+
n_concurrent=n_concurrent,
|
|
297
|
+
jobs_dir=jobs_dir,
|
|
298
|
+
skip_quality_check=skip_quality_check,
|
|
299
|
+
skip_baseline=skip_baseline,
|
|
300
|
+
skip_classify=skip_classify,
|
|
301
|
+
analysis_model=analysis_model,
|
|
302
|
+
environment=environment,
|
|
303
|
+
timeout_multiplier=timeout_multiplier,
|
|
304
|
+
verbose=verbose,
|
|
305
|
+
classification_timeout=classification_timeout,
|
|
306
|
+
verdict_timeout=verdict_timeout,
|
|
307
|
+
save_to_dir=save_to_dir,
|
|
308
|
+
)
|
|
309
|
+
)
|
|
310
|
+
|
|
311
|
+
|
|
312
|
+
|
|
313
|
+
|
|
314
|
+
@app.command(help="Continuous PR farming - stream through entire PR history")
|
|
315
|
+
def farm(
|
|
316
|
+
repo: str = typer.Argument(
|
|
317
|
+
..., help="GitHub repository in owner/name format (e.g., fastapi/fastapi)"
|
|
318
|
+
),
|
|
319
|
+
output: Path = typer.Option(
|
|
320
|
+
Path("tasks"), help="Output directory for generated tasks", show_default=True
|
|
321
|
+
),
|
|
322
|
+
state_dir: Path = typer.Option(
|
|
323
|
+
Path(".state"), help="State directory for cache/logs", show_default=True
|
|
324
|
+
),
|
|
325
|
+
force: bool = typer.Option(True, help="Regenerate even if task already exists"),
|
|
326
|
+
timeout: int = typer.Option(300, help="Timeout per PR in seconds", show_default=True),
|
|
327
|
+
cc_timeout: int = typer.Option(
|
|
328
|
+
3200, help="Timeout for Claude Code session in seconds (~53 min default)", show_default=True
|
|
329
|
+
),
|
|
330
|
+
api_delay: float = typer.Option(
|
|
331
|
+
0.5, help="Delay between GitHub API calls in seconds", show_default=True
|
|
332
|
+
),
|
|
333
|
+
task_delay: int = typer.Option(60, help="Delay between tasks in seconds", show_default=True),
|
|
334
|
+
reset: bool = typer.Option(False, "--reset", help="Reset state and start from beginning"),
|
|
335
|
+
resume_from: str
|
|
336
|
+
| None = typer.Option(
|
|
337
|
+
None, help="Resume from date (e.g., '2024-01-15' or '2024-01-15T10:30:00Z')"
|
|
338
|
+
),
|
|
339
|
+
dry_run: bool = typer.Option(
|
|
340
|
+
False, "--dry-run", help="Only show what would run (no task generation)"
|
|
341
|
+
),
|
|
342
|
+
docker_prune_batch: int = typer.Option(
|
|
343
|
+
5, help="Run docker cleanup after every N PRs (0 to disable)", show_default=True
|
|
344
|
+
),
|
|
345
|
+
skip_list: str
|
|
346
|
+
| None = typer.Option(None, help="Path to file with task IDs to skip (one per line)"),
|
|
347
|
+
no_cache: bool = typer.Option(
|
|
348
|
+
False, "--no-cache", help="Disable reusing cached Dockerfiles/test.sh"
|
|
349
|
+
),
|
|
350
|
+
require_minimum_difficulty: bool = typer.Option(
|
|
351
|
+
True,
|
|
352
|
+
help="Require minimum difficulty (3+ source files); --no-require-minimum-difficulty to skip this check",
|
|
353
|
+
),
|
|
354
|
+
min_source_files: int = typer.Option(
|
|
355
|
+
3, help="Minimum number of source files required (tests excluded)", show_default=True
|
|
356
|
+
),
|
|
357
|
+
max_source_files: int = typer.Option(
|
|
358
|
+
10,
|
|
359
|
+
help="Maximum number of source files to avoid large refactors (tests excluded)",
|
|
360
|
+
show_default=True,
|
|
361
|
+
),
|
|
362
|
+
environment: str = typer.Option(
|
|
363
|
+
"docker",
|
|
364
|
+
"-e",
|
|
365
|
+
"--env",
|
|
366
|
+
help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
|
|
367
|
+
show_default=True,
|
|
368
|
+
),
|
|
369
|
+
verbose: bool = typer.Option(False, "-v", "--verbose", help="Enable verbose output"),
|
|
370
|
+
issue_only: bool = typer.Option(
|
|
371
|
+
True,
|
|
372
|
+
"--issue-only",
|
|
373
|
+
help="Only process PRs with linked issues (higher quality instructions)",
|
|
374
|
+
),
|
|
375
|
+
validate: bool = typer.Option(
|
|
376
|
+
True, help="Run Harbor validation after CC; --no-validate to skip"
|
|
377
|
+
),
|
|
378
|
+
) -> None:
|
|
379
|
+
"""
|
|
380
|
+
Continuously process merged GitHub PRs and convert them to Harbor tasks.
|
|
381
|
+
Streams PRs page-by-page, processes them immediately, and maintains state for resumable operation.
|
|
382
|
+
Uses a language-agnostic pipeline that works for any repository.
|
|
383
|
+
"""
|
|
384
|
+
config = FarmConfig(
|
|
385
|
+
repo=repo,
|
|
386
|
+
output=output,
|
|
387
|
+
state_dir=state_dir,
|
|
388
|
+
force=force,
|
|
389
|
+
timeout=timeout,
|
|
390
|
+
cc_timeout=cc_timeout,
|
|
391
|
+
api_delay=api_delay,
|
|
392
|
+
task_delay=task_delay,
|
|
393
|
+
reset=reset,
|
|
394
|
+
resume_from=resume_from,
|
|
395
|
+
dry_run=dry_run,
|
|
396
|
+
docker_prune_batch=docker_prune_batch,
|
|
397
|
+
skip_list=skip_list,
|
|
398
|
+
no_cache=no_cache,
|
|
399
|
+
require_minimum_difficulty=require_minimum_difficulty,
|
|
400
|
+
min_source_files=min_source_files,
|
|
401
|
+
max_source_files=max_source_files,
|
|
402
|
+
environment=EnvironmentType(environment),
|
|
403
|
+
verbose=verbose,
|
|
404
|
+
issue_only=issue_only,
|
|
405
|
+
validate=validate,
|
|
406
|
+
)
|
|
407
|
+
|
|
408
|
+
console = Console()
|
|
409
|
+
farmer = StreamFarmer(config.repo, config, console)
|
|
410
|
+
exit_code = farmer.run()
|
|
411
|
+
raise typer.Exit(code=exit_code)
|
swegen/config.py
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
1
|
+
from __future__ import annotations
|
|
2
|
+
|
|
3
|
+
from dataclasses import dataclass, field
|
|
4
|
+
from pathlib import Path
|
|
5
|
+
from typing import Literal
|
|
6
|
+
|
|
7
|
+
from harbor.models.environment_type import EnvironmentType
|
|
8
|
+
|
|
9
|
+
|
|
10
|
+
@dataclass(frozen=True)
|
|
11
|
+
class CreateConfig:
|
|
12
|
+
"""Configuration for the create command (PR → Harbor task).
|
|
13
|
+
|
|
14
|
+
The create command uses a language-agnostic pipeline that works
|
|
15
|
+
for any repository. Claude Code analyzes the repo to detect language, runtime,
|
|
16
|
+
build system, and test framework automatically.
|
|
17
|
+
|
|
18
|
+
Attributes:
|
|
19
|
+
repo: GitHub repository in "owner/repo" format or full URL
|
|
20
|
+
pr: Pull request number
|
|
21
|
+
output: Output directory for generated tasks (default: tasks/)
|
|
22
|
+
cc_timeout: Timeout for Claude Code session in seconds
|
|
23
|
+
validate: Run Harbor validations (NOP + Oracle)
|
|
24
|
+
force: Bypass local dedupe and regenerate existing tasks
|
|
25
|
+
state_dir: Directory for local state/cache
|
|
26
|
+
use_cache: Reuse cached Dockerfiles/test.sh from previous tasks
|
|
27
|
+
require_minimum_difficulty: Require 3+ source files for task
|
|
28
|
+
min_source_files: Minimum number of source files required (default: 3)
|
|
29
|
+
max_source_files: Maximum number of source files allowed to avoid large refactors (default: 10)
|
|
30
|
+
require_issue: Require PR to have a linked issue (higher quality instructions)
|
|
31
|
+
allow_unmerged: Allow processing unmerged PRs (for testing/preview, default: False)
|
|
32
|
+
environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
|
|
33
|
+
verbose: Increase output verbosity
|
|
34
|
+
quiet: Reduce output verbosity
|
|
35
|
+
"""
|
|
36
|
+
|
|
37
|
+
repo: str
|
|
38
|
+
pr: int
|
|
39
|
+
output: Path = field(default_factory=lambda: Path("tasks"))
|
|
40
|
+
cc_timeout: int = 3200
|
|
41
|
+
validate: bool = True
|
|
42
|
+
force: bool = False
|
|
43
|
+
state_dir: Path = field(default_factory=lambda: Path(".state"))
|
|
44
|
+
use_cache: bool = True
|
|
45
|
+
require_minimum_difficulty: bool = True
|
|
46
|
+
min_source_files: int = 3
|
|
47
|
+
max_source_files: int = 10
|
|
48
|
+
require_issue: bool = True
|
|
49
|
+
allow_unmerged: bool = False
|
|
50
|
+
environment: EnvironmentType = EnvironmentType.DOCKER
|
|
51
|
+
verbose: bool = False
|
|
52
|
+
quiet: bool = False
|
|
53
|
+
|
|
54
|
+
# Computed property for backward compatibility with old code
|
|
55
|
+
@property
|
|
56
|
+
def no_validate(self) -> bool:
|
|
57
|
+
"""Inverse of validate for backward compatibility."""
|
|
58
|
+
return not self.validate
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
@dataclass(frozen=True)
|
|
62
|
+
class FarmConfig:
|
|
63
|
+
"""Configuration for the farm command (continuous PR processing).
|
|
64
|
+
|
|
65
|
+
The farm command uses a language-agnostic pipeline that works
|
|
66
|
+
for any repository. Claude Code analyzes the repo to detect language, runtime,
|
|
67
|
+
build system, and test framework automatically.
|
|
68
|
+
|
|
69
|
+
Attributes:
|
|
70
|
+
repo: GitHub repository in "owner/repo" format
|
|
71
|
+
output: Output directory for generated tasks (default: tasks/)
|
|
72
|
+
state_dir: Directory for local state/cache
|
|
73
|
+
force: Regenerate even if task already exists
|
|
74
|
+
timeout: Timeout per PR in seconds
|
|
75
|
+
cc_timeout: Timeout for Claude Code session in seconds
|
|
76
|
+
api_delay: Delay between GitHub API calls in seconds
|
|
77
|
+
task_delay: Delay between tasks in seconds
|
|
78
|
+
reset: Reset state and start from beginning
|
|
79
|
+
resume_from: Resume from date (ISO format or YYYY-MM-DD)
|
|
80
|
+
dry_run: Only show what would run (no task generation)
|
|
81
|
+
docker_prune_batch: Run docker cleanup after every N PRs (0 to disable)
|
|
82
|
+
skip_list: Path to file with task IDs to skip
|
|
83
|
+
no_cache: Disable reusing cached Dockerfiles/test.sh
|
|
84
|
+
require_minimum_difficulty: Require 3+ source files for task
|
|
85
|
+
min_source_files: Minimum number of source files required (default: 3)
|
|
86
|
+
max_source_files: Maximum number of source files allowed to avoid large refactors (default: 10)
|
|
87
|
+
environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
|
|
88
|
+
verbose: Enable verbose output
|
|
89
|
+
issue_only: Only process PRs that have linked issues (higher quality instructions)
|
|
90
|
+
validate: Run Harbor validation after CC (useful when CC times out but task may be valid)
|
|
91
|
+
"""
|
|
92
|
+
|
|
93
|
+
repo: str
|
|
94
|
+
output: Path = field(default_factory=lambda: Path("tasks"))
|
|
95
|
+
state_dir: Path = field(default_factory=lambda: Path(".state"))
|
|
96
|
+
force: bool = True
|
|
97
|
+
timeout: int = 300
|
|
98
|
+
cc_timeout: int = 900
|
|
99
|
+
api_delay: float = 0.5
|
|
100
|
+
task_delay: int = 60
|
|
101
|
+
reset: bool = False
|
|
102
|
+
resume_from: str | None = None
|
|
103
|
+
dry_run: bool = False
|
|
104
|
+
docker_prune_batch: int = 5
|
|
105
|
+
skip_list: str | None = None
|
|
106
|
+
no_cache: bool = False
|
|
107
|
+
require_minimum_difficulty: bool = True
|
|
108
|
+
min_source_files: int = 3
|
|
109
|
+
max_source_files: int = 10
|
|
110
|
+
environment: EnvironmentType = EnvironmentType.DOCKER
|
|
111
|
+
verbose: bool = False
|
|
112
|
+
issue_only: bool = False
|
|
113
|
+
validate: bool = True
|
|
114
|
+
|
|
115
|
+
|
|
116
|
+
@dataclass(frozen=True)
|
|
117
|
+
class ValidateConfig:
|
|
118
|
+
"""Configuration for the validate command.
|
|
119
|
+
|
|
120
|
+
Attributes:
|
|
121
|
+
path: Path to Harbor dataset root or specific task directory
|
|
122
|
+
task: Task ID when path points to dataset root
|
|
123
|
+
agent: Agent to run: both, nop, or oracle
|
|
124
|
+
jobs_dir: Directory to store Harbor job artifacts
|
|
125
|
+
timeout_multiplier: Multiply default timeouts
|
|
126
|
+
environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
|
|
127
|
+
verbose: Increase output verbosity
|
|
128
|
+
quiet: Reduce output verbosity
|
|
129
|
+
max_parallel: Maximum number of parallel validations (batch mode)
|
|
130
|
+
show_passed: Show passed tasks in output (batch mode)
|
|
131
|
+
"""
|
|
132
|
+
|
|
133
|
+
path: Path
|
|
134
|
+
task: str | None = None
|
|
135
|
+
agent: Literal["both", "nop", "oracle"] = "both"
|
|
136
|
+
jobs_dir: Path = field(default_factory=lambda: Path(".state/harbor-jobs"))
|
|
137
|
+
timeout_multiplier: float | None = None
|
|
138
|
+
environment: EnvironmentType = EnvironmentType.DOCKER
|
|
139
|
+
verbose: bool = False
|
|
140
|
+
quiet: bool = False
|
|
141
|
+
max_parallel: int = 8
|
|
142
|
+
show_passed: bool = False
|