PyPI - codeprobe - Versions diffs - 0.5.4__tar.gz → 0.7.0__tar.gz - Mend

codeprobe 0.5.4tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (324) hide show

codeprobe-0.7.0/.claude/skills/acceptance-loop/SKILL.md ADDED Viewed

@@ -0,0 +1,278 @@
+---
+name: acceptance-loop
+description: Orchestrate the continuous Test→Verify→Fix→Release acceptance loop for codeprobe. Spawns a Test Agent to produce a workspace, runs the Verifier to produce verdict.json, feeds verdicts into the convergence controller, spawns a Fix Agent when failures remain, runs the regression gate after every fix, and promotes to the release gate after two consecutive green verdicts. Triggers on acceptance loop, convergence loop, test verify fix, /acceptance-loop.
+user-invocable: false
+---
+# Acceptance Loop: Continuous Test→Verify→Fix→Release
+## Purpose
+Drive codeprobe toward a releasable state by repeatedly spawning a Test Agent to exercise the tool, running the behavioral Verifier against the produced workspace, and spawning a Fix Agent when the verdict contains failures. Every fix is gated by `acceptance/regression.py` (pytest + ruff + mypy with auto-revert), every verdict is fed into `acceptance/converge.py` for a deterministic CONTINUE / HALT / RELEASE / ESCALATE decision, and release promotion is gated by `acceptance/release.py` (wheel build + staged smoke test + version bump + tag). The loop is ZFC-compliant: all policy decisions are structured-data policy, not model judgment.
+This SKILL is the single entry point. Sub-skills for spawning each agent live in [`test-agent.md`](./test-agent.md) and [`fix-agent.md`](./fix-agent.md) — do **not** inline their prompts here; read them from disk and substitute parameters.
+---
+## Parameters
+| Name | Required | Default | Description |
+|------|----------|---------|-------------|
+| `target_repo` | yes | — | Absolute path to the frozen test repo the Test Agent exercises. |
+| `pinned_sha` | yes | — | Expected git SHA of `target_repo`. Mismatch halts the loop before iteration 1. |
+| `max_iterations` | no | `5` | Hard cap on loop iterations. Passed to `ConvergenceController(max_iterations=...)`. |
+| `eval_mode` | no | `dry-run` | `dry-run` (no agent calls) or `real` (cost-bounded). Forwarded to the Test Agent. |
+| `repo_root` | no | `/home/ds/projects/codeprobe` | codeprobe repo the Fix Agent edits. Also the regression-gate target. |
+Reject the invocation if `target_repo` or `pinned_sha` are missing — no interactive prompting; this skill assumes it is invoked programmatically by `/acceptance-loop` with fully-bound parameters.
+---
+## Phase 0: Configure
+### 0.1 Parse and validate parameters
+Bind the parameters above into shell variables (`TARGET_REPO`, `PINNED_SHA`, `MAX_ITERATIONS`, `EVAL_MODE`, `REPO_ROOT`). Fail fast with a `FAILURE: <reason>` line if any required value is missing or non-absolute.
+### 0.2 Stale workspace cleanup
+Remove any `/tmp/codeprobe-loop-*` directory older than 24 hours so long-running sessions don't fill `/tmp`:
+```bash
+find /tmp -maxdepth 1 -type d -name 'codeprobe-loop-*' -mtime +1 -print -exec rm -rf {} +
+```
+### 0.3 Disk space pre-check
+Refuse to start if `/tmp` has less than 2 GB free — the Test Agent captures full CLI output and the wheel staging step creates a venv:
+```bash
+FREE_KB=$(df -Pk /tmp | awk 'NR==2 {print $4}')
+if [ "$FREE_KB" -lt 2097152 ]; then
+  echo "FAILURE: /tmp has <2GB free ($FREE_KB KB); aborting acceptance loop"
+  exit 1
+fi
+```
+### 0.4 Concurrent-run lock (git tag)
+Use a local-only git tag `codeprobe-loop-running` as a mutex. Stale locks older than 4 hours are auto-removed; fresh locks block. The tag is removed in the Cleanup section regardless of how the loop exits:
+```bash
+cd "$REPO_ROOT"
+if git rev-parse -q --verify refs/tags/codeprobe-loop-running >/dev/null; then
+  LOCK_EPOCH=$(git log -1 --format=%ct refs/tags/codeprobe-loop-running 2>/dev/null || echo 0)
+  AGE=$(( $(date +%s) - LOCK_EPOCH ))
+  if [ "$AGE" -gt 14400 ]; then
+    git tag -d codeprobe-loop-running
+  else
+    echo "FAILURE: another acceptance-loop run holds codeprobe-loop-running (age ${AGE}s)"
+    exit 1
+  fi
+fi
+git tag codeprobe-loop-running
+```
+### 0.5 Loop workspace root
+```bash
+LOOP_ROOT=/tmp/codeprobe-loop-$(date +%Y%m%d-%H%M%S)
+mkdir -p "$LOOP_ROOT"
+CONVERGE_DB="$LOOP_ROOT/converge.db"
+VERDICT_HISTORY=()
+```
+Each iteration gets its own subdirectory `$LOOP_ROOT/iter-<N>/` that the Test Agent uses as its workspace and that holds that iteration's `verdict.json`.
+---
+## Phase 1: Test & Verify (per iteration)
+For each `ITER` in `1..MAX_ITERATIONS`:
+### 1.1 Per-iteration workspace
+```bash
+WORKSPACE="$LOOP_ROOT/iter-$ITER"
+mkdir -p "$WORKSPACE"
+```
+### 1.2 Spawn the Test Agent sub-agent
+#### 1.2a Compile criterion-driven actions
+Before reading the Test Agent prompt, compile the Phase 5 pipeline steps from `acceptance/criteria.toml`. This replaces the old hardcoded 5a-5e pipeline with one step per criterion that the Verifier can actually check:
+```bash
+COMPILED_ACTIONS=$(python3 -c "
+import pathlib
+from acceptance.loader import load_criteria
+from codeprobe.acceptance_compiler import compile_actions
+criteria = load_criteria()
+actions = compile_actions(
+    criteria,
+    target_repo=pathlib.Path('$TARGET_REPO'),
+    workspace=pathlib.Path('$WORKSPACE'),
+    project_root=pathlib.Path('$REPO_ROOT'),
+)
+for i, a in enumerate(actions, start=1):
+    print(f'### 5.{i}. {a.description}')
+    print()
+    print('\`\`\`')
+    print(a.shell_snippet)
+    print('\`\`\`')
+    print()
+")
+```
+#### 1.2b Bind and spawn
+Read `./.claude/skills/acceptance-loop/test-agent.md`, substitute the five `{{PARAM}}` tokens (`{{ITERATION}}`, `{{TARGET_REPO}}`, `{{PINNED_SHA}}`, `{{EVAL_MODE}}`, `{{COMPILED_ACTIONS}}`), and hand the bound prompt to a `general-purpose` sub-agent via the Agent tool. Also pass `{{WORKSPACE}} = $WORKSPACE` if the sub-skill references it.
+Wait for the sub-agent to exit. It MUST produce `$WORKSPACE/workspace-manifest.json`. If the manifest is missing, jump to the ESCALATE handler with reason `test_agent_no_manifest`.
+### 1.3 Run the Verifier
+The verifier has no argparse CLI, so drive it via a one-shot `python3 -c` that imports `Verifier`, runs it, and writes the verdict to `$WORKSPACE/verdict.json`:
+```bash
+python3 -c "
+import pathlib
+from acceptance.verify import Verifier
+v = Verifier(pathlib.Path('$REPO_ROOT/acceptance/criteria.toml'),
+             project_root=pathlib.Path('$REPO_ROOT'))
+verdict = v.run(pathlib.Path('$WORKSPACE'), iteration=$ITER)
+v.write_verdict(verdict, pathlib.Path('$WORKSPACE/verdict.json'))
+" || { echo 'FAILURE: verifier crashed'; exit 3; }
+VERDICT_HISTORY+=("$WORKSPACE/verdict.json")
+```
+### 1.4 Record the verdict with the convergence controller
+```bash
+python3 -c "
+import json, pathlib
+from acceptance.converge import ConvergenceController
+cc = ConvergenceController(pathlib.Path('$CONVERGE_DB'), max_iterations=$MAX_ITERATIONS)
+cc.record_verdict(json.loads(pathlib.Path('$WORKSPACE/verdict.json').read_text()))
+"
+```
+### 1.5 Ask for the decision
+```bash
+DECISION=$(python3 -c "
+import pathlib
+from acceptance.converge import ConvergenceController
+cc = ConvergenceController(pathlib.Path('$CONVERGE_DB'), max_iterations=$MAX_ITERATIONS)
+print(cc.decide().decision.value)
+")
+```
+Branch on `$DECISION`:
+- `release` → jump to **Phase 3: Release**.
+- `continue` → proceed to **Phase 2: Fix** (if the verdict has failures) or loop back to 1.1 with `ITER++`.
+- `halt_max_iterations` | `halt_regression` | `halt_stuck` | `escalate` → jump to **Halt Conditions**.
+---
+## Phase 2: Fix (conditional)
+Only entered when `$DECISION == continue` AND the verdict has `fail_count > 0`. If `fail_count == 0` but the controller still says `continue`, skip directly to the next iteration (the loop is waiting for the second green in a row).
+### 2.1 Spawn the Fix Agent sub-agent
+Read `./.claude/skills/acceptance-loop/fix-agent.md`, substitute its parameters (`{{ITERATION}}`, `{{REPO_ROOT}}`, `{{VERDICT_PATH}}`), and hand the bound prompt to a fresh `general-purpose` sub-agent. The Fix Agent is contractually constrained to produce exactly ONE commit or print `FAILURE: <criterion_id>` on stdout.
+### 2.2 Regression gate after every fix
+Run the regression gate against `$REPO_ROOT`. It pytests, ruffs, mypys, and auto-reverts HEAD on failure:
+```bash
+python3 -m acceptance.regression --repo-root "$REPO_ROOT"
+RC=$?
+if [ $RC -ne 0 ]; then
+  echo "regression gate FAILED at iteration $ITER — commit reverted"
+fi
+```
+A regression-gate failure is **not** an automatic halt — the Test Agent re-runs on the reverted tree in the next iteration. The convergence controller halts the loop on its own via `HALT_REGRESSION` if `pass_count` drops between consecutive verdicts.
+### 2.3 Loop
+`ITER=$((ITER+1))` and jump back to **Phase 1: Test & Verify**. Do not clear `$CONVERGE_DB` — it is the source of truth for the two-green-in-a-row release check.
+---
+## Phase 3: Release (conditional)
+Entered exactly once when `cc.decide() == Decision.RELEASE`. Release is all-or-nothing: any sub-step failure aborts with an escalation report, leaves the lock tag in place until Cleanup, and returns non-zero.
+```bash
+python3 -c "
+import pathlib, sys
+from acceptance.release import ReleaseGate
+gate = ReleaseGate(pathlib.Path('$REPO_ROOT'))
+verdicts = [pathlib.Path(p) for p in '''${VERDICT_HISTORY[@]}'''.split()]
+if not gate.check_ready(verdicts):
+    print('FAILURE: release gate refused — verdict history not ready'); sys.exit(2)
+staging = gate.build_and_stage()
+if staging.error:
+    print(f'FAILURE: staging failed — {staging.error}'); sys.exit(3)
+new_version = gate.bump_version('patch')
+tag = gate.prepare_tag(new_version)
+print(f'RELEASE_READY version={new_version} tag={tag}')
+" || { echo 'release gate failed'; exit 4; }
+```
+Show the user the `RELEASE_READY` line plus the staged wheel path. The actual `git push --tags` is a human action — this loop stops at "tag prepared locally".
+---
+## Halt Conditions
+When `cc.decide()` returns a non-CONTINUE/non-RELEASE decision, render and surface the escalation report before cleaning up:
+```bash
+python3 -c "
+import pathlib
+from acceptance.converge import ConvergenceController
+cc = ConvergenceController(pathlib.Path('$CONVERGE_DB'), max_iterations=$MAX_ITERATIONS)
+print(cc.get_escalation_report())
+" > "$LOOP_ROOT/escalation.md"
+cat "$LOOP_ROOT/escalation.md"
+```
+Decision-specific user messaging:
+- **`halt_max_iterations`** — Loop cap hit without reaching two-green. Report iteration count, latest `pass_count/fail_count`, and the escalation markdown. Exit code 10.
+- **`halt_regression`** — A fix made things worse (`pass_count` dropped). Report the regression delta from the decision context. The offending commit was already reverted by the regression gate; point the user at `$LOOP_ROOT/iter-<N>/verdict.json`. Exit code 11.
+- **`halt_stuck`** / **`escalate`** — Three-strike rule triggered: same criterion failed 3 iterations in a row with identical evidence. Report the stuck criterion IDs, their evidence, and recommend human investigation. Exit code 12.
+In every halt path, preserve `$LOOP_ROOT` for post-mortem — do NOT delete it in Cleanup.
+---
+## Cleanup
+Always executed, even on failure, via a `trap` at the top of the loop or an explicit final block:
+1. Remove the concurrent-run lock: `cd "$REPO_ROOT" && git tag -d codeprobe-loop-running 2>/dev/null || true`.
+2. On RELEASE or max-iterations-green paths, optionally prune `$LOOP_ROOT` — otherwise preserve it and print the path so the user can inspect iteration workspaces and `escalation.md`.
+3. Print a one-line summary: `acceptance-loop done: iterations=N decision=$DECISION workspace=$LOOP_ROOT`.
+---
+## References
+- `acceptance/criteria.toml` — 25 seed criteria in TOML.
+- `acceptance/loader.py::load_criteria()` — parsed into `Criterion` objects.
+- `src/codeprobe/acceptance_compiler.py::compile_actions()` — compiles criteria into Test Agent shell actions.
+- `acceptance/verify.py::Verifier.run()` / `.write_verdict()` — produces `verdict.json`.
+- `acceptance/converge.py::ConvergenceController` — `record_verdict`, `decide`, `is_release_ready`, `get_escalation_report`.
+- `acceptance/regression.py` — `python3 -m acceptance.regression --repo-root <path>`.
+- `acceptance/release.py::ReleaseGate` — `check_ready`, `build_and_stage`, `bump_version`, `prepare_tag`.
+- [`test-agent.md`](./test-agent.md) — Test Agent sub-skill prompt (do not inline).
+- [`fix-agent.md`](./fix-agent.md) — Fix Agent sub-skill prompt (do not inline).

codeprobe-0.7.0/.claude/skills/assess-codebase/SKILL.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+name: assess-codebase
+description: Assess a codebase for AI agent benchmarking potential. Analyzes repo structure, complexity, and history to estimate how well-suited it is for meaningful agent evaluation. Triggers on assess codebase, codebase assessment, evaluate codebase, codebase readiness, benchmark potential.
+user-invocable: true
+---
+# Assess Codebase
+Analyze a codebase to determine how well-suited it is for meaningful AI agent benchmarking. Produces a readiness report covering repo structure, complexity, history depth, test infrastructure, and task mining potential.
+Invokes `codeprobe assess` under the hood -- all analysis runs through the CLI, not Python imports.
+---
+## Phase 0: Assessment Goals
+Ask the user:
+**Question 1** -- Header: "Target codebase"
+- Question: "Which codebase should I assess?"
+- Options:
+  - **Current directory** -- "Assess the repo in the current working directory"
+  - **Specific path** -- "I'll provide a path to a local repo"
+If **Current directory**, set `REPO_PATH=.`.
+If **Specific path**, prompt for the absolute path and set `REPO_PATH={user_input}`.
+### Validate Path
+Before proceeding, confirm the path is a valid git repo:
+```bash
+git -C {REPO_PATH} rev-parse --git-dir 2>/dev/null && echo "valid" || echo "not a git repo"
+```
+If not a git repo, ask the user for a different path.
+---
+## Phase 1: Run Assessment
+Execute the codeprobe CLI:
+```bash
+codeprobe assess {REPO_PATH}
+```
+This analyzes:
+- Repository structure and size
+- Language distribution
+- Code complexity signals
+- Git history depth and merge activity
+- Test infrastructure coverage
+- Build system and CI presence
+---
+## Phase 2: Present Results
+Display the assessment output to the user. Highlight:
+1. **Benchmarking potential** -- Is this repo a good candidate for agent evaluation?
+2. **Task mining readiness** -- Does the repo have enough merge history and test coverage for `/mine-tasks`?
+3. **Key strengths** -- What makes this repo good for benchmarking (e.g., rich PR history, strong test suite)
+4. **Gaps** -- What's missing that would improve benchmarking quality (e.g., no CI, sparse test coverage)
+---
+## Phase 3: Next Steps
+Based on the assessment, suggest concrete follow-up actions:
+```
+Suggested next steps:
+  1. {If repo scores well}: Run `codeprobe mine {REPO_PATH}` to extract eval tasks
+     from merged PRs.
+  2. {If test coverage is low}: Consider adding tests before benchmarking --
+     agents can't be scored without a ground truth.
+  3. {If history is shallow}: The repo needs more merged PRs for meaningful
+     task mining. Consider using a more active repo.
+```
+---
+## Quick Reference
+| User says | What happens |
+|-----------|-------------|
+| `/assess-codebase` | Assess current directory |
+| `/assess-codebase /path/to/repo` | Assess specific repo |
+| "is this repo good for benchmarking?" | Same as `/assess-codebase` |
+| "evaluate my codebase" | Same as `/assess-codebase` |

codeprobe-0.7.0/.claude/skills/codeprobe-calibrate/SKILL.md ADDED Viewed

@@ -0,0 +1,87 @@
+---
+name: codeprobe-calibrate
+description: Run the codeprobe calibration gate and emit a curator profile when the R11 validity thresholds are met. Compares two curators over a holdout and enforces minimum tasks, minimum repos, and Pearson correlation before accepting. Triggers on calibrate curator, calibration gate, validity gate, curator profile, r11 gate, pearson correlation. Use this when a new curator version needs to be qualified before it is used in mining or scoring pipelines.
+user-invocable: false
+---
+# codeprobe calibrate (autonomous agent contract)
+Gate a curator version against a holdout set. A profile is emitted only when
+three validity conditions are met: holdout size, repo diversity, and Pearson
+correlation against the reference curator. Any failure exits non-zero without
+writing a profile.
+## Environment (pre-loaded)
+- !`codeprobe doctor --json`
+If doctor reports provider-related failures (e.g. `LLM_UNAVAILABLE`), calibrate
+will almost certainly fail as well. Resolve doctor first.
+## Bare invocation
+Minimum viable call. `--curator-version` is required:
+```bash
+codeprobe calibrate <holdout_path> --json --curator-version <id>
+```
+Emit the profile to a specific path:
+```bash
+codeprobe calibrate <holdout_path> --json --curator-version <id> --out <profile.json>
+```
+Adjust acceptance thresholds for an exploratory run (defaults are the R11
+thresholds of 0.6 correlation / 100 tasks / 3 repos — do NOT relax in CI):
+```bash
+codeprobe calibrate <holdout_path> --json --curator-version <id> --threshold 0.6 --min-tasks 100 --min-repos 3
+```
+## JSON fields to parse
+```json
+{
+  "status": "ok" | "error",
+  "command": "calibrate",
+  "exit_code": 0,
+  "data": {
+    "curator_version": "...",
+    "holdout_tasks": <int>,
+    "holdout_repos": <int>,
+    "pearson_correlation": <float>,
+    "thresholds": { "min_tasks": <int>, "min_repos": <int>, "threshold": <float> },
+    "profile_path": "<abs-path | null>",
+    "passed": <bool>
+  },
+  "errors": [ { "code": "<CODE>", "message": "...", "remediation": "...", "terminal": <bool> } ]
+}
+```
+`profile_path` is `null` unless `passed == true`. A passed gate is the only
+condition under which any profile artifact exists.
+## Error handling
+Only the codes below may surface. Cross-reference `src/codeprobe/cli/error_codes.json`.
+| Code | Kind | Retryable? | Action |
+|---|---|---|---|
+| CALIBRATION_REJECTED | diagnostic | no | Increase holdout size / repo diversity, or accept the curator is not qualified. Do not auto-retry with a lowered threshold — that defeats the gate. |
+| METADATA_INVALID | diagnostic | no | Holdout rows are malformed; fix data and re-run. |
+| METADATA_MISSING | diagnostic | no | Required metadata columns are missing from the holdout. |
+| LLM_UNAVAILABLE | diagnostic | yes (bounded) | Provider outage; one retry permitted. |
+| INTERRUPTED | diagnostic | **TERMINAL — do not retry** | Signal halted the run; stop. |
+## Retry policy
+- Maximum retry depth per error chain: **2**. After two consecutive errors
+  sharing the same code, stop and surface the envelope to the caller.
+- Terminal errors (INTERRUPTED) are **never** retried.
+- CALIBRATION_REJECTED is a validity signal, not a transient error. Treat it
+  as terminal-for-this-holdout even though the error code itself is diagnostic
+  — retrying the same inputs will produce the same rejection.
+- Never mutate `--threshold`, `--min-tasks`, or `--min-repos` on retry.
+  Those values encode the R11 validity contract; changing them is a human
+  decision that must live in configuration, not in retry logic.

codeprobe-0.7.0/.claude/skills/codeprobe-check-infra/SKILL.md ADDED Viewed

@@ -0,0 +1,106 @@
+---
+name: codeprobe-check-infra
+description: Diagnose mined-task infrastructure for drift and offline readiness. Compares metadata.json capability snapshots to live capabilities and runs credential-TTL preflight for airgapped runs. Triggers on check infra, capability drift, preamble drift, offline preflight, credential ttl, airgapped run readiness. Use this before running mined tasks that were produced on a different machine or weeks ago.
+user-invocable: false
+---
+# codeprobe check-infra (autonomous agent contract)
+Pre-run diagnostics for mined task directories and airgapped environments.
+Splits into two primary subcommands: `drift` (capability snapshot vs live) and
+`offline` (credential TTL vs expected run duration).
+## Environment (pre-loaded)
+- !`codeprobe doctor --json`
+- !`codeprobe check-infra offline --json`
+`doctor` gives the overall readiness state; `check-infra offline --json`
+pre-warms the credential-TTL surface so the agent can decide up front whether
+an offline run is viable. If the offline envelope reports `status == "error"`
+with `OFFLINE_PREFLIGHT_FAILED`, do NOT attempt an offline run before resolving.
+## Bare invocation
+Capability drift against a specific task directory:
+```bash
+codeprobe check-infra drift <task_dir> --json
+```
+Tolerate drift (emit warning instead of failing):
+```bash
+codeprobe check-infra drift <task_dir> --json --allow-capability-drift
+```
+Offline credential preflight for an anticipated 2-hour run:
+```bash
+codeprobe check-infra offline --json --expected-run-duration 2h
+```
+Restrict the offline check to a single backend:
+```bash
+codeprobe check-infra offline --json --backend claude
+```
+## JSON fields to parse
+Drift:
+```json
+{
+  "status": "ok" | "error",
+  "command": "check-infra drift",
+  "exit_code": 0,
+  "data": {
+    "task_dir": "<abs-path>",
+    "drift_detected": <bool>,
+    "snapshot_capabilities": [ "..." ],
+    "live_capabilities": [ "..." ],
+    "added": [ "..." ],
+    "removed": [ "..." ]
+  },
+  "errors": [ { "code": "<CODE>", "message": "...", "remediation": "...", "terminal": <bool> } ]
+}
+```
+Offline:
+```json
+{
+  "status": "ok" | "error",
+  "command": "check-infra offline",
+  "exit_code": 0,
+  "data": {
+    "expected_run_duration_seconds": <int>,
+    "backends": [ { "name": "...", "ttl_seconds": <int | null>, "ok": <bool> } ]
+  },
+  "errors": [ { "code": "<CODE>", "message": "...", "remediation": "...", "terminal": <bool> } ]
+}
+```
+## Error handling
+Only the codes below may surface. Cross-reference `src/codeprobe/cli/error_codes.json`.
+| Code | Kind | Retryable? | Action |
+|---|---|---|---|
+| CAPABILITY_DRIFT | diagnostic | no | Run `codeprobe doctor --capabilities`; re-mine or re-baseline if intentional. |
+| METADATA_MISSING | diagnostic | no | Target task_dir has no metadata.json; stop. |
+| OFFLINE_PREFLIGHT_FAILED | diagnostic | no | At least one backend's credential TTL is too short; rotate/refresh credentials. |
+| OFFLINE_NET_ATTEMPT | diagnostic | no | Component attempted network IO while offline; fix config. |
+| STALE_USER_HOME_SKILL | diagnostic | yes (with fix) | Re-install the referenced skill bundle per remediation. |
+| DOCTOR_CHECKS_FAILED | diagnostic | no | Cross-surfaced from doctor; resolve those checks first. |
+| INTERRUPTED | diagnostic | **TERMINAL — do not retry** | Signal halted the command; stop. |
+## Retry policy
+- Maximum retry depth per error chain: **2**. After two consecutive errors
+  sharing the same code, stop and surface the envelope to the caller.
+- Terminal errors (INTERRUPTED) are **never** retried.
+- Drift errors almost always need a human decision (re-mine vs accept-drift).
+  Do not auto-retry with `--allow-capability-drift` unless the caller asked
+  for it — that flag changes semantics, not transient state.

codeprobe-0.7.0/.claude/skills/codeprobe-interpret/SKILL.md ADDED Viewed

@@ -0,0 +1,80 @@
+---
+name: codeprobe-interpret
+description: Analyze eval results from codeprobe runs. Compares configurations statistically, ranks by score and cost-efficiency, and produces actionable recommendations in JSON or pretty text. Triggers on interpret results, analyze eval results, compare configurations, rank agents, score regression, plot regression. Use this when the agent needs to turn a `codeprobe run` output directory into structured analysis.
+user-invocable: false
+---
+# codeprobe interpret (autonomous agent contract)
+Turn a results directory (or mined-tasks directory in `--regression` mode) into
+a structured analysis envelope. Reporting-only: no side effects on the target
+data.
+## Environment (pre-loaded)
+- !`codeprobe doctor --json`
+`doctor` is the single source of truth for environment readiness. Interpret is
+read-only, so most doctor failures (missing backends, credentials) do NOT block
+this command. Still, if doctor reports a corrupt `.codeprobe` state, resolve it
+before interpreting.
+## Bare invocation
+```bash
+codeprobe interpret <results_path> --json
+```
+Regression mode (per-task score over commit history from `codeprobe mine --refresh`):
+```bash
+codeprobe interpret <tasks_path> --json --regression --results <results_path>
+```
+Alternative serialization via `--format` (applies only when `--json` is not set):
+```bash
+codeprobe interpret <results_path> --format csv
+```
+## JSON fields to parse
+```json
+{
+  "status": "ok" | "error",
+  "command": "interpret",
+  "exit_code": 0,
+  "data": {
+    "configs": [
+      { "id": "...", "score_mean": <float>, "cost_mean_usd": <float>, "rank": <int> }
+    ],
+    "recommendations": [ { "text": "...", "confidence": <float> } ],
+    "regression": { "task_id": "...", "series": [ { "sha": "...", "score": <float> } ] }
+  },
+  "errors": [ { "code": "<CODE>", "message": "...", "remediation": "...", "terminal": <bool> } ]
+}
+```
+`data.regression` is only present when `--regression` is passed. `data.configs`
+is always a sorted list; `rank == 1` is the top config.
+## Error handling
+Interpret is reporting-only, so the error surface is small. Only the codes
+below may surface. Cross-reference `src/codeprobe/cli/error_codes.json`.
+| Code | Kind | Retryable? | Action |
+|---|---|---|---|
+| NO_TASKS | diagnostic | no | Target results dir has no tasks; check the path. |
+| METADATA_MISSING | diagnostic | no | Structural integrity problem; stop and surface. |
+| METADATA_INVALID | diagnostic | no | Structural integrity problem; run `codeprobe validate --strict` first. |
+| INTERRUPTED | diagnostic | **TERMINAL — do not retry** | Signal halted the run; stop. |
+## Retry policy
+- Maximum retry depth per error chain: **2**. After two consecutive errors
+  sharing the same code, stop and surface the envelope to the caller.
+- Terminal errors (INTERRUPTED) are **never** retried.
+- Because interpret is read-only, "retry" almost always means the upstream data
+  is wrong. Fix the data (re-run `codeprobe run` or `codeprobe validate`)
+  rather than loop on the same inputs.

codeprobe 0.5.4__tar.gz → 0.7.0__tar.gz

codeprobe 0.5.4tar.gz → 0.7.0tar.gz