swegen 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,126 @@
1
+ You are synthesizing multiple independent trial analyses into a final task quality verdict.
2
+
3
+ ## Critical Calibration: Hard Tasks are GOOD
4
+
5
+ **Hard tasks with low pass rates are DESIRABLE for benchmarks.** A 20-40% pass rate indicates a well-calibrated task. Do NOT penalize tasks for being difficult.
6
+
7
+ - 0% pass rate: Could be too hard OR could be a task problem (investigate)
8
+ - 10-40% pass rate: IDEAL for benchmark tasks - challenging but solvable
9
+ - 50-70% pass rate: Good but possibly too easy
10
+ - 80-100% pass rate: Too easy for benchmarking (unless intended)
11
+
12
+ **Failures are EXPECTED.** The default classification for failures should be GOOD_FAILURE (agent's fault) unless there's strong evidence the task itself is broken.
13
+
14
+ ## Context
15
+
16
+ You have {num_trials} trial runs of the same Harbor task. Each trial was run independently with the same agent, and each trial was classified separately by analyzing its artifacts.
17
+
18
+ **Baseline Validation:**
19
+ {baseline_summary}
20
+
21
+ **Static Quality Check:**
22
+ {quality_check_summary}
23
+
24
+ ## Individual Trial Classifications
25
+
26
+ {trial_classifications}
27
+
28
+ ## Your Goal
29
+
30
+ Synthesize these independent analyses into a **final verdict** on the task quality.
31
+
32
+ Consider:
33
+ 1. **Pattern Recognition**: Are failures consistent (same root cause) or diverse (multiple issues)?
34
+ 2. **Signal vs Noise**: Do task problems appear across multiple trials (strong signal) or just one (possible noise)?
35
+ 3. **Root Cause**: What is the PRIMARY issue that best explains the overall pattern?
36
+ 4. **Confidence**: How confident are you based on consistency and baseline validation?
37
+ 5. **Actionability**: What are the most important fixes needed (if any)?
38
+ 6. **Solvability**: Did ANY trial succeed? If yes, the task IS solvable - failures are agent variance.
39
+ 7. **Over-classification check**: Is "Underspecified Instruction" just meaning "task is hard"? That's not a problem.
40
+
41
+ ## Classification Categories
42
+
43
+ **Task problems** (need fixing):
44
+ - BAD_FAILURE: Task specification issues (underspecified instruction, brittle tests, ambiguous requirements, etc.)
45
+ - BAD_SUCCESS: Cheating/gaming (hardcoding, test inspection, oracle copying, tests too permissive)
46
+
47
+ **Normal outcomes** (task is fine):
48
+ - GOOD_SUCCESS: Agent solved it legitimately
49
+ - GOOD_FAILURE: Agent couldn't solve it due to agent limitations (timeout, wrong approach, complexity, etc.)
50
+ - HARNESS_ERROR: Infrastructure issues (not task's fault)
51
+
52
+ ## Confidence Levels
53
+
54
+ - **high**: All trials agree OR baseline validation critical failure OR clear consistent pattern
55
+ - **medium**: Majority of trials agree (>50%) OR mixed signals but leaning one way OR no successes but failures look legitimate
56
+ - **low**: Trials disagree OR unclear pattern OR only harness errors
57
+
58
+ ## Decision Logic
59
+
60
+ **Default assumption: Task is GOOD.** Only mark as bad with strong evidence.
61
+
62
+ 1. **Critical baseline failures** → is_good=false, high confidence
63
+ - If nop passed: "task may be pre-solved"
64
+ - If oracle failed: "reference solution doesn't work"
65
+
66
+ 2. **Consistent STRONG task problems** → is_good=false
67
+ - Majority (>50%) show the SAME BAD_FAILURE subtype with clear evidence → medium confidence
68
+ - All trials show task problems AND evidence is compelling → high confidence
69
+ - NOTE: "Underspecified Instruction" alone is often OVER-diagnosed. Scrutinize this carefully.
70
+
71
+ 3. **Some successes, mixed failures** → is_good=true, medium-high confidence
72
+ - Having ANY success means the task IS solvable
73
+ - Mixed failures are normal for hard tasks
74
+
75
+ 4. **All failures but mostly GOOD_FAILURE** → is_good=true, medium confidence
76
+ - Task is hard but not broken
77
+ - This is the IDEAL outcome for a challenging benchmark task
78
+
79
+ 5. **Mix of GOOD_FAILURE and scattered BAD_FAILURE** → is_good=true, medium confidence
80
+ - Isolated BAD_FAILURE classifications may be noise/over-classification
81
+ - Default to trusting the task unless pattern is clear
82
+
83
+ 6. **Any successes** → is_good=true, medium-high confidence
84
+ - Task is clearly solvable
85
+
86
+ ## Primary Issue Format
87
+
88
+ If is_good=false, write a clear primary issue:
89
+ - For consistent problem: "{{count}}/{{total}} trials show: {{most_common_subtype}} - {{brief_explanation}}"
90
+ - For diverse problems: "{{count}}/{{total}} trials show task issues: {{list_of_problems}}"
91
+ - For baseline: "CRITICAL: {{baseline_issue}}"
92
+
93
+ If is_good=true:
94
+ - null (or omit)
95
+
96
+ ## Recommendations
97
+
98
+ For BAD tasks, provide 3-5 **specific, actionable** recommendations:
99
+ - Prioritize issues that appeared in multiple trials
100
+ - Focus on fixes that address the root cause
101
+ - Be concrete (e.g., "Add file path to instruction.md line 15" not "improve instructions")
102
+ - Deduplicate similar recommendations across trials
103
+
104
+ For GOOD tasks, recommendations should be empty or minor suggestions.
105
+
106
+ ## Output Format
107
+
108
+ Return ONLY valid JSON (no markdown, no code blocks, no explanation):
109
+
110
+ {{
111
+ "is_good": true/false,
112
+ "confidence": "high" | "medium" | "low",
113
+ "primary_issue": "clear description of main problem" or null,
114
+ "recommendations": ["specific fix 1", "specific fix 2", ...],
115
+ "reasoning": "1-2 sentence explanation of your verdict based on the trial pattern"
116
+ }}
117
+
118
+ ## Important Notes
119
+
120
+ - **Default to GOOD**: Tasks that passed baseline validation are presumed good. Failures are expected.
121
+ - **Consistency matters**: 1/4 trials showing issue is noise, 3/4 with SAME issue is a signal
122
+ - **Baseline validation overrides**: Critical baseline failures always mean is_good=false
123
+ - **Be skeptical of BAD_FAILURE**: "Underspecified Instruction" is often over-diagnosed. Ask: could the agent have figured this out by exploring the codebase?
124
+ - **Low pass rates are GOOD**: 25% pass rate on a hard task is ideal for benchmarking
125
+ - **Successes prove solvability**: Any success means the task CAN be solved with the given instruction
126
+
swegen/cli.py ADDED
@@ -0,0 +1,411 @@
1
+ from __future__ import annotations
2
+
3
+ from importlib.metadata import PackageNotFoundError as _PkgNotFound
4
+ from importlib.metadata import version as _pkg_version
5
+ from pathlib import Path
6
+
7
+ import typer
8
+ import json
9
+ from dotenv import load_dotenv
10
+ from harbor.models.environment_type import EnvironmentType
11
+ from rich.console import Console
12
+
13
+ from swegen.config import CreateConfig, FarmConfig
14
+ from swegen.create import MissingIssueError, TrivialPRError
15
+ from swegen.create.create import run_reversal
16
+ from swegen.farm import StreamFarmer
17
+ from swegen.analyze import AnalyzeArgs, run_analyze, TrialClassifier, write_trial_analysis_files
18
+ from swegen.tools.validate import ValidateArgs, run_validate
19
+ from swegen.tools.validate_utils import ValidationError
20
+
21
+ load_dotenv()
22
+
23
+ app = typer.Typer(no_args_is_help=True, add_completion=False, help="Task generation CLI")
24
+
25
+
26
+ @app.callback(invoke_without_command=True)
27
+ def _root(
28
+ version: bool = typer.Option(
29
+ False,
30
+ "--version",
31
+ "-V",
32
+ help="Show swegen version and exit",
33
+ is_eager=True,
34
+ ),
35
+ ) -> None:
36
+ if version:
37
+ try:
38
+ typer.echo(f"swegen {_pkg_version('swe-gen')}")
39
+ except _PkgNotFound:
40
+ typer.echo("swegen (version unknown)")
41
+ raise typer.Exit()
42
+
43
+
44
+ create_app = typer.Typer(
45
+ no_args_is_help=True,
46
+ invoke_without_command=True,
47
+ add_completion=False,
48
+ help="Create a Harbor task from a merged PR and validate",
49
+ )
50
+
51
+
52
+ @create_app.callback()
53
+ def create_cmd(
54
+ repo: str = typer.Option(..., help="GitHub repository (owner/repo or URL)"),
55
+ pr: int = typer.Option(..., help="PR number"),
56
+ output: Path = typer.Option(Path("tasks"), help="Output root", show_default=True),
57
+ cc_timeout: int = typer.Option(
58
+ 3200, help="Timeout for CC session in seconds (~53 min default)", show_default=True
59
+ ),
60
+ validate: bool = typer.Option(
61
+ True, help="Run Harbor validations; --no-validate skips validation"
62
+ ),
63
+ force: bool = typer.Option(False, help="Bypass local dedupe and regenerate"),
64
+ state_dir: Path = typer.Option(
65
+ Path(".state"), help="Local dedupe state dir", show_default=True
66
+ ),
67
+ no_cache: bool = typer.Option(
68
+ False, "--no-cache", help="Disable reusing cached Dockerfiles/test.sh from previous tasks"
69
+ ),
70
+ require_minimum_difficulty: bool = typer.Option(
71
+ True,
72
+ help="Require minimum difficulty (3+ source files); --no-require-minimum-difficulty to skip this check",
73
+ ),
74
+ min_source_files: int = typer.Option(
75
+ 3, help="Minimum number of source files required (tests excluded)", show_default=True
76
+ ),
77
+ max_source_files: int = typer.Option(
78
+ 10,
79
+ help="Maximum number of source files to avoid large refactors (tests excluded)",
80
+ show_default=True,
81
+ ),
82
+ require_issue: bool = typer.Option(
83
+ True,
84
+ help="Require PR to have a linked issue (higher quality instructions); --no-require-issue uses PR body/title instead",
85
+ ),
86
+ allow_unmerged: bool = typer.Option(
87
+ False,
88
+ help="Allow processing unmerged PRs (for testing/preview); --allow-unmerged to enable",
89
+ ),
90
+ environment: str = typer.Option(
91
+ "docker",
92
+ "-e",
93
+ "--env",
94
+ help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
95
+ show_default=True,
96
+ ),
97
+ verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
98
+ quiet: bool = typer.Option(False, "-q", "--quiet", help="Reduce output verbosity"),
99
+ ) -> None:
100
+ config = CreateConfig(
101
+ repo=repo,
102
+ pr=pr,
103
+ output=output,
104
+ cc_timeout=cc_timeout,
105
+ validate=validate,
106
+ force=force,
107
+ state_dir=state_dir,
108
+ use_cache=not no_cache,
109
+ require_minimum_difficulty=require_minimum_difficulty,
110
+ min_source_files=min_source_files,
111
+ max_source_files=max_source_files,
112
+ require_issue=require_issue,
113
+ allow_unmerged=allow_unmerged,
114
+ environment=EnvironmentType(environment),
115
+ verbose=verbose,
116
+ quiet=quiet,
117
+ )
118
+ try:
119
+ run_reversal(config)
120
+ except (TrivialPRError, MissingIssueError, ValidationError, FileExistsError) as err:
121
+ # These exceptions have already displayed user-friendly messages
122
+ # Exit with error code but don't show traceback
123
+ raise SystemExit(1) from err
124
+
125
+
126
+ app.add_typer(create_app, name="create")
127
+
128
+
129
+ @app.command(help="Validate an existing Harbor task by running NOP and ORACLE")
130
+ def validate(
131
+ path: Path = typer.Argument(
132
+ ...,
133
+ help="Path to Harbor dataset root, specific task directory, or task ID when used with dataset root",
134
+ ),
135
+ task: str
136
+ | None = typer.Option(None, "--task", "-t", help="Task ID when --path points to dataset root"),
137
+ agent: str = typer.Option("both", help="Agent to run: both|nop|oracle", show_default=True),
138
+ jobs_dir: Path = typer.Option(
139
+ Path(".state/harbor-jobs"),
140
+ help="Directory to store Harbor job artifacts",
141
+ show_default=True,
142
+ ),
143
+ timeout_multiplier: float
144
+ | None = typer.Option(None, help="Multiply default timeouts (e.g., 3.0)"),
145
+ environment: str = typer.Option(
146
+ "docker",
147
+ "-e",
148
+ "--env",
149
+ help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
150
+ show_default=True,
151
+ ),
152
+ verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
153
+ quiet: bool = typer.Option(False, "-q", "--quiet", help="Reduce output verbosity"),
154
+ max_parallel: int = typer.Option(
155
+ 8, help="Maximum number of parallel validations (batch mode only)", show_default=True
156
+ ),
157
+ show_passed: bool = typer.Option(
158
+ False,
159
+ "--show-passed",
160
+ help="Show passed tasks in output (batch mode: default shows only failures)",
161
+ ),
162
+ output: Path
163
+ | None = typer.Option(
164
+ None, "-o", "--output", help="Write results to file as they complete (batch mode only)"
165
+ ),
166
+ docker_prune_batch: int = typer.Option(
167
+ 5,
168
+ help="Run docker cleanup after every N tasks (0 to disable, local docker only)",
169
+ show_default=True,
170
+ ),
171
+ ) -> None:
172
+ if agent not in ("both", "nop", "oracle"):
173
+ raise typer.BadParameter("agent must be one of: both, nop, oracle")
174
+ run_validate(
175
+ ValidateArgs(
176
+ path=path,
177
+ task=task,
178
+ jobs_dir=jobs_dir,
179
+ agent=agent,
180
+ timeout_multiplier=timeout_multiplier,
181
+ verbose=verbose,
182
+ quiet=quiet,
183
+ environment=EnvironmentType(environment),
184
+ max_parallel=max_parallel,
185
+ show_passed=show_passed,
186
+ output_file=output,
187
+ docker_prune_batch=docker_prune_batch,
188
+ )
189
+ )
190
+
191
+
192
+ @app.command(help="Analyze a task by running agent trials and classifying outcomes")
193
+ def analyze(
194
+ path: Path = typer.Argument(..., help="Path to the task directory to analyze"),
195
+ agent: str = typer.Option(
196
+ "claude-code", "-a", "--agent", help="Agent to run trials with", show_default=True
197
+ ),
198
+ model: str = typer.Option(
199
+ "anthropic/claude-sonnet-4-5",
200
+ "-m",
201
+ "--model",
202
+ help="Model to use for agent trials",
203
+ show_default=True,
204
+ ),
205
+ n_trials: int = typer.Option(
206
+ 3, "-k", "--n-trials", help="Number of trials to run", show_default=True
207
+ ),
208
+ n_concurrent: int = typer.Option(
209
+ 3, "-n", "--n-concurrent", help="Number of concurrent trials (1=sequential, 3-5 recommended)", show_default=True
210
+ ),
211
+ jobs_dir: Path = typer.Option(
212
+ Path(".state/analyze-jobs"),
213
+ "--jobs-dir",
214
+ help="Directory to store job artifacts",
215
+ show_default=True,
216
+ ),
217
+ skip_quality_check: bool = typer.Option(
218
+ False, "--skip-quality-check", help="Skip static quality check"
219
+ ),
220
+ skip_baseline: bool = typer.Option(
221
+ False, "--skip-baseline", help="Skip baseline validation (nop/oracle)"
222
+ ),
223
+ skip_classify: bool = typer.Option(
224
+ False, "--skip-classify", help="Skip LLM classification of trial outcomes"
225
+ ),
226
+ analysis_model: str = typer.Option(
227
+ "claude-sonnet-4-5",
228
+ "--analysis-model",
229
+ help="Model for Claude Code classification",
230
+ show_default=True,
231
+ ),
232
+ timeout_multiplier: float = typer.Option(
233
+ 1.0, "--timeout-multiplier", help="Multiply default timeouts", show_default=True
234
+ ),
235
+ environment: str = typer.Option(
236
+ "docker",
237
+ "-e",
238
+ "--env",
239
+ help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
240
+ show_default=True,
241
+ ),
242
+ verbose: bool = typer.Option(False, "-v", "--verbose", help="Increase output verbosity"),
243
+ classification_timeout: int = typer.Option(
244
+ 300,
245
+ "--classification-timeout",
246
+ help="Timeout per trial classification in seconds",
247
+ show_default=True,
248
+ ),
249
+ verdict_timeout: int = typer.Option(
250
+ 180,
251
+ "--verdict-timeout",
252
+ help="Timeout for verdict synthesis in seconds",
253
+ show_default=True,
254
+ ),
255
+ save_to_dir: bool = typer.Option(
256
+ False,
257
+ "--save-to-dir",
258
+ help="Write trajectory-analysis.{md,json} to each trial directory",
259
+ ),
260
+ ) -> None:
261
+ """
262
+ Analyze a Harbor task to determine if it's well-specified.
263
+
264
+ This command classifies trial outcomes to identify TASK PROBLEMS vs AGENT PROBLEMS:
265
+
266
+ 1. Static quality check (Harbor's tasks check)
267
+ 2. Baseline validation (nop should fail, oracle should pass)
268
+ 3. Run N agent trials (default: 3 with Claude Code)
269
+ 4. Classify each trial outcome:
270
+ - GOOD_SUCCESS: Agent solved it correctly
271
+ - BAD_SUCCESS: Agent cheated or tests too permissive
272
+ - GOOD_FAILURE: Agent failed due to its own limitations
273
+ - BAD_FAILURE: Agent failed due to task issues
274
+ - HARNESS_ERROR: Infrastructure problem
275
+ 5. Compute task verdict with recommendations
276
+
277
+ The goal is to identify tasks that need fixing before release.
278
+
279
+ Flags match Harbor CLI conventions:
280
+ -k / --n-trials: Total number of trials to run
281
+ -n / --n-concurrent: Number of trials to run concurrently (parallelism)
282
+
283
+ Examples:
284
+ # Sequential (default)
285
+ swegen analyze tasks/my-task -k 5
286
+
287
+ # Parallel (3 trials at once)
288
+ swegen analyze tasks/my-task -k 10 -n 3
289
+ """
290
+ run_analyze(
291
+ AnalyzeArgs(
292
+ task_path=path,
293
+ agent=agent,
294
+ model=model,
295
+ n_trials=n_trials,
296
+ n_concurrent=n_concurrent,
297
+ jobs_dir=jobs_dir,
298
+ skip_quality_check=skip_quality_check,
299
+ skip_baseline=skip_baseline,
300
+ skip_classify=skip_classify,
301
+ analysis_model=analysis_model,
302
+ environment=environment,
303
+ timeout_multiplier=timeout_multiplier,
304
+ verbose=verbose,
305
+ classification_timeout=classification_timeout,
306
+ verdict_timeout=verdict_timeout,
307
+ save_to_dir=save_to_dir,
308
+ )
309
+ )
310
+
311
+
312
+
313
+
314
+ @app.command(help="Continuous PR farming - stream through entire PR history")
315
+ def farm(
316
+ repo: str = typer.Argument(
317
+ ..., help="GitHub repository in owner/name format (e.g., fastapi/fastapi)"
318
+ ),
319
+ output: Path = typer.Option(
320
+ Path("tasks"), help="Output directory for generated tasks", show_default=True
321
+ ),
322
+ state_dir: Path = typer.Option(
323
+ Path(".state"), help="State directory for cache/logs", show_default=True
324
+ ),
325
+ force: bool = typer.Option(True, help="Regenerate even if task already exists"),
326
+ timeout: int = typer.Option(300, help="Timeout per PR in seconds", show_default=True),
327
+ cc_timeout: int = typer.Option(
328
+ 3200, help="Timeout for Claude Code session in seconds (~53 min default)", show_default=True
329
+ ),
330
+ api_delay: float = typer.Option(
331
+ 0.5, help="Delay between GitHub API calls in seconds", show_default=True
332
+ ),
333
+ task_delay: int = typer.Option(60, help="Delay between tasks in seconds", show_default=True),
334
+ reset: bool = typer.Option(False, "--reset", help="Reset state and start from beginning"),
335
+ resume_from: str
336
+ | None = typer.Option(
337
+ None, help="Resume from date (e.g., '2024-01-15' or '2024-01-15T10:30:00Z')"
338
+ ),
339
+ dry_run: bool = typer.Option(
340
+ False, "--dry-run", help="Only show what would run (no task generation)"
341
+ ),
342
+ docker_prune_batch: int = typer.Option(
343
+ 5, help="Run docker cleanup after every N PRs (0 to disable)", show_default=True
344
+ ),
345
+ skip_list: str
346
+ | None = typer.Option(None, help="Path to file with task IDs to skip (one per line)"),
347
+ no_cache: bool = typer.Option(
348
+ False, "--no-cache", help="Disable reusing cached Dockerfiles/test.sh"
349
+ ),
350
+ require_minimum_difficulty: bool = typer.Option(
351
+ True,
352
+ help="Require minimum difficulty (3+ source files); --no-require-minimum-difficulty to skip this check",
353
+ ),
354
+ min_source_files: int = typer.Option(
355
+ 3, help="Minimum number of source files required (tests excluded)", show_default=True
356
+ ),
357
+ max_source_files: int = typer.Option(
358
+ 10,
359
+ help="Maximum number of source files to avoid large refactors (tests excluded)",
360
+ show_default=True,
361
+ ),
362
+ environment: str = typer.Option(
363
+ "docker",
364
+ "-e",
365
+ "--env",
366
+ help="Environment type for Harbor runs (docker|daytona|e2b|modal|runloop|gke)",
367
+ show_default=True,
368
+ ),
369
+ verbose: bool = typer.Option(False, "-v", "--verbose", help="Enable verbose output"),
370
+ issue_only: bool = typer.Option(
371
+ True,
372
+ "--issue-only",
373
+ help="Only process PRs with linked issues (higher quality instructions)",
374
+ ),
375
+ validate: bool = typer.Option(
376
+ True, help="Run Harbor validation after CC; --no-validate to skip"
377
+ ),
378
+ ) -> None:
379
+ """
380
+ Continuously process merged GitHub PRs and convert them to Harbor tasks.
381
+ Streams PRs page-by-page, processes them immediately, and maintains state for resumable operation.
382
+ Uses a language-agnostic pipeline that works for any repository.
383
+ """
384
+ config = FarmConfig(
385
+ repo=repo,
386
+ output=output,
387
+ state_dir=state_dir,
388
+ force=force,
389
+ timeout=timeout,
390
+ cc_timeout=cc_timeout,
391
+ api_delay=api_delay,
392
+ task_delay=task_delay,
393
+ reset=reset,
394
+ resume_from=resume_from,
395
+ dry_run=dry_run,
396
+ docker_prune_batch=docker_prune_batch,
397
+ skip_list=skip_list,
398
+ no_cache=no_cache,
399
+ require_minimum_difficulty=require_minimum_difficulty,
400
+ min_source_files=min_source_files,
401
+ max_source_files=max_source_files,
402
+ environment=EnvironmentType(environment),
403
+ verbose=verbose,
404
+ issue_only=issue_only,
405
+ validate=validate,
406
+ )
407
+
408
+ console = Console()
409
+ farmer = StreamFarmer(config.repo, config, console)
410
+ exit_code = farmer.run()
411
+ raise typer.Exit(code=exit_code)
swegen/config.py ADDED
@@ -0,0 +1,142 @@
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass, field
4
+ from pathlib import Path
5
+ from typing import Literal
6
+
7
+ from harbor.models.environment_type import EnvironmentType
8
+
9
+
10
+ @dataclass(frozen=True)
11
+ class CreateConfig:
12
+ """Configuration for the create command (PR → Harbor task).
13
+
14
+ The create command uses a language-agnostic pipeline that works
15
+ for any repository. Claude Code analyzes the repo to detect language, runtime,
16
+ build system, and test framework automatically.
17
+
18
+ Attributes:
19
+ repo: GitHub repository in "owner/repo" format or full URL
20
+ pr: Pull request number
21
+ output: Output directory for generated tasks (default: tasks/)
22
+ cc_timeout: Timeout for Claude Code session in seconds
23
+ validate: Run Harbor validations (NOP + Oracle)
24
+ force: Bypass local dedupe and regenerate existing tasks
25
+ state_dir: Directory for local state/cache
26
+ use_cache: Reuse cached Dockerfiles/test.sh from previous tasks
27
+ require_minimum_difficulty: Require 3+ source files for task
28
+ min_source_files: Minimum number of source files required (default: 3)
29
+ max_source_files: Maximum number of source files allowed to avoid large refactors (default: 10)
30
+ require_issue: Require PR to have a linked issue (higher quality instructions)
31
+ allow_unmerged: Allow processing unmerged PRs (for testing/preview, default: False)
32
+ environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
33
+ verbose: Increase output verbosity
34
+ quiet: Reduce output verbosity
35
+ """
36
+
37
+ repo: str
38
+ pr: int
39
+ output: Path = field(default_factory=lambda: Path("tasks"))
40
+ cc_timeout: int = 3200
41
+ validate: bool = True
42
+ force: bool = False
43
+ state_dir: Path = field(default_factory=lambda: Path(".state"))
44
+ use_cache: bool = True
45
+ require_minimum_difficulty: bool = True
46
+ min_source_files: int = 3
47
+ max_source_files: int = 10
48
+ require_issue: bool = True
49
+ allow_unmerged: bool = False
50
+ environment: EnvironmentType = EnvironmentType.DOCKER
51
+ verbose: bool = False
52
+ quiet: bool = False
53
+
54
+ # Computed property for backward compatibility with old code
55
+ @property
56
+ def no_validate(self) -> bool:
57
+ """Inverse of validate for backward compatibility."""
58
+ return not self.validate
59
+
60
+
61
+ @dataclass(frozen=True)
62
+ class FarmConfig:
63
+ """Configuration for the farm command (continuous PR processing).
64
+
65
+ The farm command uses a language-agnostic pipeline that works
66
+ for any repository. Claude Code analyzes the repo to detect language, runtime,
67
+ build system, and test framework automatically.
68
+
69
+ Attributes:
70
+ repo: GitHub repository in "owner/repo" format
71
+ output: Output directory for generated tasks (default: tasks/)
72
+ state_dir: Directory for local state/cache
73
+ force: Regenerate even if task already exists
74
+ timeout: Timeout per PR in seconds
75
+ cc_timeout: Timeout for Claude Code session in seconds
76
+ api_delay: Delay between GitHub API calls in seconds
77
+ task_delay: Delay between tasks in seconds
78
+ reset: Reset state and start from beginning
79
+ resume_from: Resume from date (ISO format or YYYY-MM-DD)
80
+ dry_run: Only show what would run (no task generation)
81
+ docker_prune_batch: Run docker cleanup after every N PRs (0 to disable)
82
+ skip_list: Path to file with task IDs to skip
83
+ no_cache: Disable reusing cached Dockerfiles/test.sh
84
+ require_minimum_difficulty: Require 3+ source files for task
85
+ min_source_files: Minimum number of source files required (default: 3)
86
+ max_source_files: Maximum number of source files allowed to avoid large refactors (default: 10)
87
+ environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
88
+ verbose: Enable verbose output
89
+ issue_only: Only process PRs that have linked issues (higher quality instructions)
90
+ validate: Run Harbor validation after CC (useful when CC times out but task may be valid)
91
+ """
92
+
93
+ repo: str
94
+ output: Path = field(default_factory=lambda: Path("tasks"))
95
+ state_dir: Path = field(default_factory=lambda: Path(".state"))
96
+ force: bool = True
97
+ timeout: int = 300
98
+ cc_timeout: int = 900
99
+ api_delay: float = 0.5
100
+ task_delay: int = 60
101
+ reset: bool = False
102
+ resume_from: str | None = None
103
+ dry_run: bool = False
104
+ docker_prune_batch: int = 5
105
+ skip_list: str | None = None
106
+ no_cache: bool = False
107
+ require_minimum_difficulty: bool = True
108
+ min_source_files: int = 3
109
+ max_source_files: int = 10
110
+ environment: EnvironmentType = EnvironmentType.DOCKER
111
+ verbose: bool = False
112
+ issue_only: bool = False
113
+ validate: bool = True
114
+
115
+
116
+ @dataclass(frozen=True)
117
+ class ValidateConfig:
118
+ """Configuration for the validate command.
119
+
120
+ Attributes:
121
+ path: Path to Harbor dataset root or specific task directory
122
+ task: Task ID when path points to dataset root
123
+ agent: Agent to run: both, nop, or oracle
124
+ jobs_dir: Directory to store Harbor job artifacts
125
+ timeout_multiplier: Multiply default timeouts
126
+ environment: Environment type for Harbor runs (docker, daytona, e2b, modal, runloop, gke)
127
+ verbose: Increase output verbosity
128
+ quiet: Reduce output verbosity
129
+ max_parallel: Maximum number of parallel validations (batch mode)
130
+ show_passed: Show passed tasks in output (batch mode)
131
+ """
132
+
133
+ path: Path
134
+ task: str | None = None
135
+ agent: Literal["both", "nop", "oracle"] = "both"
136
+ jobs_dir: Path = field(default_factory=lambda: Path(".state/harbor-jobs"))
137
+ timeout_multiplier: float | None = None
138
+ environment: EnvironmentType = EnvironmentType.DOCKER
139
+ verbose: bool = False
140
+ quiet: bool = False
141
+ max_parallel: int = 8
142
+ show_passed: bool = False