@delegance/claude-autopilot 7.9.0 → 7.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,34 @@
2
2
 
3
3
  - v5.6 Phase 7 (docs reconciliation) — pending.
4
4
 
5
+ ## 7.10.0 — 2026-05-13
6
+
7
+ ### Added
8
+ - **Retry-loop sameness detector** (`src/core/run-state/sameness-detector.ts`) — new pure-TS module exporting `computeFingerprint`, `isSameFailure`, `shouldEscalate`, and `stripVolatileTokens`. A failure fingerprint is `{ phase, errorType, errorLocation, errorMessage, hash }` where `hash` is `sha256(JSON.stringify([phase, errorType, errorLocation, normalize(message[:200])]))`. JSON encoding is delimiter-safe — pipes, quotes, and embedded JSON in fields cannot produce cross-tuple collisions. `shouldEscalate(history)` returns `{ escalate: true }` when the last two entries have identical hashes — the signal that retries are making no progress.
9
+ - **Volatile-token stripping built into the detector.** Both `errorLocation` and `errorMessage` are scrubbed of UUIDs, ISO timestamps, 13-digit epoch ms, sha1/sha256 hex digests, /tmp + /var/folders paths, and localhost:port before hashing — so a retry whose message differs only in a per-run UUID or temp directory still hashes to the same fingerprint. Exported as `stripVolatileTokens()` for callers that want to scrub before constructing a fingerprint.
10
+ - **Public package subpath export.** `package.json` adds `./run-state/sameness-detector` pointing at `dist/src/core/run-state/sameness-detector.{js,d.ts}` so consumers can `import { computeFingerprint, shouldEscalate } from '@delegance/claude-autopilot/run-state/sameness-detector'` without deep-importing into compiled paths.
11
+ - **Pipeline halts when retries make no progress, even if you have retries remaining.** `skills/autopilot/SKILL.md` Step 4 (validate), Step 7 (Codex PR review), and Step 8 (bugbot) now consult the detector before consuming a retry. If the same failure fingerprint fires twice in a row inside any retry loop, the pipeline stops and surfaces the matching fingerprint to the user instead of burning the remaining retry budget. This catches the class of bug where validate retries fix nothing because the underlying type error is unreachable from the change set.
12
+ - Tests: `tests/run-state/sameness-detector.test.ts` (32 cases) covers the three issue-#181 acceptance scenarios (same × 2 escalates, same × 1 continues, different × 3 continues), edge cases (empty history, ABA pattern, message truncation, all three phases), delimiter-safety (fields containing `|`), and volatile-token scrubbing (UUIDs, timestamps, tmpdirs). `tests/run-state/sameness-detector-integration.test.ts` (7 cases) verifies SKILL.md retry-block references and that the compiled subpath export is importable + functional.
13
+
14
+ ### Notes
15
+ - Persistence is intentionally in-memory only in v7.10.0. Per-retry-loop history is held in the autopilot skill execution scope; bugbot and validate do not share a history. The v6 run-state events.ndjson integration is tracked separately as issue #180.
16
+ - Released as v7.10.0 even though issue #181 was originally labeled v7.11.0 — this ships before #178 and #179, so it gets the next minor.
17
+
18
+ ### Out of scope (still pending)
19
+ - Expand/contract migration classification (additive vs destructive enforcement) — v7.11.0 candidate
20
+ - v6 run-state engine integration into the autopilot skill (4,873 LOC of checkpoint/resume infra currently unused by the skill) — issue #180
21
+
22
+ ## 7.9.1 — 2026-05-13 (correctness hotfix)
23
+
24
+ ### Fixed
25
+ - **`skills/autopilot/SKILL.md` ran migrate BEFORE validate.** On stacks that auto-promote (Supabase-script-specific), this could leave production with new schema and no working code if validate or PR review later failed. Resequenced: validate is now Step 4, migrate-dev is Step 5. PR + Codex + bugbot follow. Production migration is explicitly handed off to the user's CI/CD pipeline.
26
+ - **Removed misleading "dev → QA → prod auto-promote" claim.** That behavior is Supabase-stack-specific, not a generic CLI capability. The skill now references the four real `migrate.policy` keys (`allow_prod_in_ci`, `require_clean_git`, `require_manual_approval`, `require_dry_run_first`) and explains how to wire them in `.autopilot/stack.md`.
27
+
28
+ ### Out of scope (filed as v7.10.0 + v7.11.0 candidates)
29
+ - Expand/contract migration classification (additive vs destructive enforcement)
30
+ - v6 run-state engine integration into the autopilot skill (4,873 LOC of checkpoint/resume infra currently unused by the skill)
31
+ - Retry-loop sameness detector ("same fingerprint twice → escalate to human")
32
+
5
33
  ## 7.9.0 — 2026-05-12
6
34
 
7
35
  ### Changed
@@ -0,0 +1,82 @@
1
+ /** Which loop in the autopilot pipeline produced the failure. The three
2
+ * Step-4 / Step-7 / Step-8 retry loops in `skills/autopilot/SKILL.md` are
3
+ * the call sites that consult the detector before consuming a retry. */
4
+ export type FailurePhase = 'validate' | 'codex-review' | 'bugbot';
5
+ /** A normalized identity for a single failure occurrence. Two fingerprints
6
+ * are "the same failure" iff their `hash` is equal — phase, errorType,
7
+ * errorLocation, and the truncated/normalized errorMessage all feed into
8
+ * the hash, so any meaningful change between retries produces a new hash. */
9
+ export interface FailureFingerprint {
10
+ phase: FailurePhase;
11
+ /** Discriminator inside the phase. Examples:
12
+ * - validate: 'tsc_error' | 'test_failure' | 'lint_error'
13
+ * - codex-review: 'codex_critical' | 'codex_warning'
14
+ * - bugbot: 'bugbot_high' | 'bugbot_medium' */
15
+ errorType: string;
16
+ /** Where the failure points. `file:line` for tsc/lint, test name for tests,
17
+ * finding-id for codex, comment-id for bugbot. Whatever uniquely locates
18
+ * the problem within the phase. */
19
+ errorLocation: string;
20
+ /** First 200 chars of the canonical message, whitespace-collapsed. The
21
+ * truncation is what makes the fingerprint stable across runs that differ
22
+ * only in trailing stack-frame noise. */
23
+ errorMessage: string;
24
+ /** sha256 hex of `${phase}|${errorType}|${errorLocation}|${errorMessage}`.
25
+ * This is the equality key. */
26
+ hash: string;
27
+ }
28
+ /** Maximum length of the normalized error message that feeds the hash.
29
+ * Anything beyond this is dropped — picked to match the issue spec and to
30
+ * keep the hash stable across runs that differ only in trailing noise. */
31
+ export declare const FINGERPRINT_MESSAGE_MAX = 200;
32
+ export interface ComputeFingerprintInput {
33
+ phase: FailurePhase;
34
+ errorType: string;
35
+ errorLocation: string;
36
+ errorMessage: string;
37
+ }
38
+ /** Strip known volatile / per-run tokens from a free-form string so that two
39
+ * retries that differ only in transient data (UUIDs, ports, epoch
40
+ * timestamps, ISO timestamps, hex SHAs, absolute temp paths) produce the
41
+ * same canonical form. Order matters — broader patterns run first so they
42
+ * can swallow embedded delimiters before narrower patterns see them.
43
+ *
44
+ * Exported because callers building locations/messages outside this module
45
+ * may want to apply the same scrubbing before constructing a fingerprint
46
+ * (e.g. when assembling an `errorLocation` from a tool output that embeds
47
+ * a run-id). */
48
+ export declare function stripVolatileTokens(s: string): string;
49
+ /** Compute a stable fingerprint for a single failure occurrence. The
50
+ * returned `hash` is the equality key — two failures with equal hashes
51
+ * are considered "the same failure" for retry-loop escalation purposes. */
52
+ export declare function computeFingerprint(input: ComputeFingerprintInput): FailureFingerprint;
53
+ /** Compare two fingerprints. They are "the same failure" iff their hashes
54
+ * match — phase/type/location/message all feed the hash, so equal hash
55
+ * means equal across all observable identity. */
56
+ export declare function isSameFailure(a: FailureFingerprint, b: FailureFingerprint): boolean;
57
+ export interface EscalationDecision {
58
+ /** True iff the caller should STOP consuming retries and surface to a
59
+ * human, because the last two recorded attempts produced the same
60
+ * failure (no progress between retries). */
61
+ escalate: boolean;
62
+ /** Set when `escalate` is true — human-readable explanation. */
63
+ reason?: string;
64
+ /** Set when `escalate` is true — the offending fingerprint that fired
65
+ * twice. Callers should display this to the operator so they can see
66
+ * what's stuck. */
67
+ fingerprint?: FailureFingerprint;
68
+ }
69
+ /** Decide whether to escalate a retry loop to a human, given the history
70
+ * of failure fingerprints recorded so far in this retry loop.
71
+ *
72
+ * Rule (per issue #181): escalate iff `history.length >= 2` AND the last
73
+ * two fingerprints have identical hashes. Anything else — first failure,
74
+ * different failures across retries, longer streak of different failures
75
+ * — returns `{ escalate: false }`.
76
+ *
77
+ * Rationale: a single retry on the SAME failure means we tried, fixed
78
+ * nothing, and failed identically. Retries that keep failing on
79
+ * *different* things are still making progress (each one is a new fix).
80
+ * Only no-progress retries should consume the escalation budget. */
81
+ export declare function shouldEscalate(history: readonly FailureFingerprint[]): EscalationDecision;
82
+ //# sourceMappingURL=sameness-detector.d.ts.map
@@ -0,0 +1,146 @@
1
+ // src/core/run-state/sameness-detector.ts
2
+ //
3
+ // Retry-loop sameness detector — escalates when the same failure fingerprint
4
+ // fires twice in a row during a retry loop (validate, codex PR review, or
5
+ // bugbot). The pipeline halts when retries make no progress, even if you have
6
+ // retries remaining.
7
+ //
8
+ // Issue: #181 (v7.11.0 — released as v7.10.0).
9
+ //
10
+ // Design:
11
+ // - `FailureFingerprint` is a hashable identity for a failure. Same hash
12
+ // across two attempts means "we tried, failed for the same reason, fixed
13
+ // nothing". That is the signal to stop burning retries and surface to a
14
+ // human.
15
+ // - Storage is in-memory only. The v6 run-state events.ndjson integration
16
+ // is tracked separately as issue #180; explicitly deferred here so the
17
+ // pipeline can adopt the detector without waiting on persistence.
18
+ // - All functions are pure (modulo `crypto.createHash`), making this easy
19
+ // to unit-test under node:test.
20
+ //
21
+ // Who calls this:
22
+ // The detector is consumed by the autopilot skill agent (an LLM following
23
+ // `skills/autopilot/SKILL.md`), NOT by the `scripts/validate.ts`,
24
+ // `scripts/codex-pr-review.ts`, or `scripts/bugbot.ts` CLI scripts. Those
25
+ // scripts are stateless per-invocation; the retry loop lives one layer
26
+ // above them, inside the skill execution. Wiring this into the CLIs would
27
+ // not catch repeated failures because each CLI invocation is a clean
28
+ // process. The skill agent is the durable retry-loop scope.
29
+ import { createHash } from 'node:crypto';
30
+ /** Maximum length of the normalized error message that feeds the hash.
31
+ * Anything beyond this is dropped — picked to match the issue spec and to
32
+ * keep the hash stable across runs that differ only in trailing noise. */
33
+ export const FINGERPRINT_MESSAGE_MAX = 200;
34
+ /** Strip known volatile / per-run tokens from a free-form string so that two
35
+ * retries that differ only in transient data (UUIDs, ports, epoch
36
+ * timestamps, ISO timestamps, hex SHAs, absolute temp paths) produce the
37
+ * same canonical form. Order matters — broader patterns run first so they
38
+ * can swallow embedded delimiters before narrower patterns see them.
39
+ *
40
+ * Exported because callers building locations/messages outside this module
41
+ * may want to apply the same scrubbing before constructing a fingerprint
42
+ * (e.g. when assembling an `errorLocation` from a tool output that embeds
43
+ * a run-id). */
44
+ export function stripVolatileTokens(s) {
45
+ if (typeof s !== 'string')
46
+ return '';
47
+ return (s
48
+ // ISO-8601 timestamps (e.g. 2026-05-13T07:00:00.000Z)
49
+ .replace(/\b\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d{1,6})?(?:Z|[+-]\d{2}:?\d{2})?\b/g, '<ts>')
50
+ // 13-digit epoch ms
51
+ .replace(/\b\d{13}\b/g, '<ts>')
52
+ // UUIDs (v1-v5)
53
+ .replace(/\b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\b/g, '<uuid>')
54
+ // 40-char (sha1) or 64-char (sha256) hex digests
55
+ .replace(/\b[0-9a-fA-F]{40}\b/g, '<sha>')
56
+ .replace(/\b[0-9a-fA-F]{64}\b/g, '<sha>')
57
+ // macOS / Linux temp paths (/tmp, /var/folders) up to the next whitespace
58
+ .replace(/\/(?:tmp|var\/folders)\/[^\s'"`]+/g, '<tmpdir>')
59
+ // localhost ports like :49213
60
+ .replace(/\b(?:127\.0\.0\.1|localhost):\d{2,5}\b/g, '<host:port>'));
61
+ }
62
+ /** Normalize a free-form error message: strip volatile tokens, trim, collapse
63
+ * all runs of whitespace (including newlines/tabs) to single spaces, and
64
+ * truncate to `FINGERPRINT_MESSAGE_MAX` characters. The truncation is what
65
+ * makes the fingerprint stable across runs whose messages differ only in
66
+ * trailing stack-frame noise. */
67
+ function normalizeMessage(msg) {
68
+ if (typeof msg !== 'string') {
69
+ return '';
70
+ }
71
+ const scrubbed = stripVolatileTokens(msg);
72
+ const collapsed = scrubbed.replace(/\s+/g, ' ').trim();
73
+ if (collapsed.length <= FINGERPRINT_MESSAGE_MAX) {
74
+ return collapsed;
75
+ }
76
+ return collapsed.slice(0, FINGERPRINT_MESSAGE_MAX);
77
+ }
78
+ /** Normalize an `errorLocation` (file path / test name / etc.) by applying
79
+ * the same volatile-token scrubbing as the message, plus whitespace trim.
80
+ * Does NOT truncate — locations are short by construction. */
81
+ function normalizeLocation(loc) {
82
+ if (typeof loc !== 'string')
83
+ return '';
84
+ return stripVolatileTokens(loc).trim();
85
+ }
86
+ /** Compute a stable fingerprint for a single failure occurrence. The
87
+ * returned `hash` is the equality key — two failures with equal hashes
88
+ * are considered "the same failure" for retry-loop escalation purposes. */
89
+ export function computeFingerprint(input) {
90
+ const phase = input.phase;
91
+ const errorType = (input.errorType ?? '').toString();
92
+ const errorLocation = normalizeLocation((input.errorLocation ?? '').toString());
93
+ const errorMessage = normalizeMessage(input.errorMessage ?? '');
94
+ // Use JSON.stringify of a 4-tuple as the canonical pre-hash serialization.
95
+ // This is unambiguous under any field content — pipe characters, quotes,
96
+ // braces, embedded JSON, etc. all serialize unambiguously and cannot
97
+ // produce collisions across different `[phase, type, location, message]`
98
+ // tuples. (Earlier drafts used a pipe-delimited string; that was vulnerable
99
+ // to delimiter ambiguity when, e.g., a test name legitimately contained
100
+ // '|'. The JSON form has no such edge case.)
101
+ const canonical = JSON.stringify([phase, errorType, errorLocation, errorMessage]);
102
+ const hash = createHash('sha256').update(canonical).digest('hex');
103
+ return { phase, errorType, errorLocation, errorMessage, hash };
104
+ }
105
+ /** Compare two fingerprints. They are "the same failure" iff their hashes
106
+ * match — phase/type/location/message all feed the hash, so equal hash
107
+ * means equal across all observable identity. */
108
+ export function isSameFailure(a, b) {
109
+ if (!a || !b)
110
+ return false;
111
+ return a.hash === b.hash;
112
+ }
113
+ /** Decide whether to escalate a retry loop to a human, given the history
114
+ * of failure fingerprints recorded so far in this retry loop.
115
+ *
116
+ * Rule (per issue #181): escalate iff `history.length >= 2` AND the last
117
+ * two fingerprints have identical hashes. Anything else — first failure,
118
+ * different failures across retries, longer streak of different failures
119
+ * — returns `{ escalate: false }`.
120
+ *
121
+ * Rationale: a single retry on the SAME failure means we tried, fixed
122
+ * nothing, and failed identically. Retries that keep failing on
123
+ * *different* things are still making progress (each one is a new fix).
124
+ * Only no-progress retries should consume the escalation budget. */
125
+ export function shouldEscalate(history) {
126
+ if (!Array.isArray(history) || history.length < 2) {
127
+ return { escalate: false };
128
+ }
129
+ const last = history[history.length - 1];
130
+ const prev = history[history.length - 2];
131
+ if (!last || !prev) {
132
+ return { escalate: false };
133
+ }
134
+ if (isSameFailure(prev, last)) {
135
+ return {
136
+ escalate: true,
137
+ reason: `Retry loop produced the same failure twice in a row ` +
138
+ `(phase=${last.phase}, errorType=${last.errorType}, ` +
139
+ `errorLocation=${last.errorLocation}). The pipeline is not making ` +
140
+ `progress — surfacing to human instead of consuming another retry.`,
141
+ fingerprint: last,
142
+ };
143
+ }
144
+ return { escalate: false };
145
+ }
146
+ //# sourceMappingURL=sameness-detector.js.map
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@delegance/claude-autopilot",
3
- "version": "7.9.0",
3
+ "version": "7.10.0",
4
4
  "type": "module",
5
5
  "publishConfig": {
6
6
  "tag": "next"
@@ -39,6 +39,10 @@
39
39
  "types": "./dist/src/index.d.ts",
40
40
  "default": "./dist/src/index.js"
41
41
  },
42
+ "./run-state/sameness-detector": {
43
+ "types": "./dist/src/core/run-state/sameness-detector.d.ts",
44
+ "default": "./dist/src/core/run-state/sameness-detector.js"
45
+ },
42
46
  "./bin/claude-autopilot.js": "./bin/claude-autopilot.js",
43
47
  "./bin/guardrail.js": "./bin/guardrail.js",
44
48
  "./package.json": "./package.json"
@@ -1,6 +1,6 @@
1
1
  ---
2
2
  name: autopilot
3
- description: End-to-end pipeline — brainstorm → spec → plan → implement → migratevalidate → PR → Codex review → bugbot. Risk-tiered. No manual intervention after spec approval.
3
+ description: End-to-end pipeline — brainstorm → spec → plan → implement → validatemigrate dev → PR → Codex review → bugbot → merge. Risk-tiered. No manual intervention after spec approval until PR merge; production deploy/migration gates are handled by the user's CI/CD pipeline.
4
4
  ---
5
5
 
6
6
  # Autopilot — Idea to Merged PR Pipeline
@@ -209,26 +209,16 @@ For each task:
209
209
  - Skip formal spec/quality review to maintain speed (the validate step catches issues)
210
210
  - If subagent fails to write to worktree: implement directly
211
211
 
212
- ### Step 4: Auto-migrate
212
+ ### Step 4: Validate
213
213
 
214
- For any `.sql` files created in `data/deltas/` during implementation:
214
+ Run both checks in order. **Autofix runs first** so the static review sees the final post-autofix diff (otherwise the LLM review is stale on autofix mutations):
215
215
 
216
216
  ```bash
217
- /migrate
218
- ```
219
-
220
- Run against dev → QA → prod with auto-promote. If migration fails, fix the SQL and retry.
221
-
222
- ### Step 5: Validate
223
-
224
- Run both checks in order:
217
+ # 1. Full project validation FIRST (autofix, tests, codex, gate) — mutates files
218
+ npx tsx scripts/validate.ts --commit-autofix --allow-dirty
225
219
 
226
- ```bash
227
- # 1. Static rules + LLM review on changed files
220
+ # 2. THEN static rules + LLM review on the final diff
228
221
  npx autopilot run --base main
229
-
230
- # 2. Full project validation (autofix, tests, codex, gate)
231
- npx tsx scripts/validate.ts --commit-autofix --allow-dirty
232
222
  ```
233
223
 
234
224
  The `validate.ts` Phase 1 includes a **tsc regression check**: it runs `npx tsc --noEmit` against both the PR and the merge-base (cached at `.claude/.tsc-baseline-cache.json`) and surfaces only files where the PR introduces *new* TypeScript errors versus the baseline. Forward-pressure check — type errors are warnings, not blockers.
@@ -236,10 +226,83 @@ The `validate.ts` Phase 1 includes a **tsc regression check**: it runs `npx tsc
236
226
  If either FAIL:
237
227
  - Read findings / validation report at `.claude/validation-report.json`
238
228
  - Fix the blocking issues
229
+ - **Before consuming a retry, compute the failure fingerprint** with `computeFingerprint({ phase: 'validate', errorType, errorLocation, errorMessage })` from `src/core/run-state/sameness-detector.ts`:
230
+ - `errorType`: `tsc_error` | `test_failure` | `lint_error` (whichever class caused the FAIL)
231
+ - `errorLocation`: prefer stable identifiers — `<repo-relative-path>:<line>` for tsc/lint, test-suite-name+test-name for tests. Avoid absolute paths and transient temp dirs; use the repo-relative form so the hash survives across working-tree moves.
232
+ - `errorMessage`: first 200 chars of the canonical message. Strip any embedded UUIDs, ports, or epoch timestamps from the message before computing — those rotate per run and will mask true sameness.
233
+
234
+ **Known limitation:** if the underlying fix shifts line numbers, two retries can produce different fingerprints even when the root cause is identical (false negative — no escalation). The current behavior is conservative: it never falsely escalates, but it may let the loop run its full retry budget on a shifting-line-number failure. If you see this pattern, prefer test name + diagnostic code over `file:line`.
235
+ - Append the fingerprint to an in-memory list for this retry loop, then call `shouldEscalate(history)`. If `escalate === true` (the last two attempts produced the same fingerprint), STOP — do not consume another retry. Surface the matching fingerprint to the user and stop the pipeline.
239
236
  - Re-run the failing check
240
237
  - Max 3 retry iterations
241
238
 
242
- If both PASS: proceed to PR.
239
+ If both PASS: proceed to Step 5.
240
+
241
+ ### Step 5: Migrate dev-only (verify SQL parses + applies)
242
+
243
+ For any `.sql` files created in `data/deltas/` during implementation, run the migration against the dev database ONLY. Always invoke explicitly — do not rely on the CLI default:
244
+
245
+ ```bash
246
+ /migrate --env=dev
247
+ ```
248
+
249
+ This verifies the SQL parses and applies against the real schema shape before the PR opens. Prod migration is **deferred to the user's CI/CD pipeline** after the PR merges — autopilot does NOT orchestrate prod deploys or migrations directly, because deployment ordering varies wildly across user infrastructure (ECS, CodeBuild, blue/green, manual approvals).
250
+
251
+ **Post-migration re-validate** (catches type/test failures that only surface after the schema actually changes):
252
+
253
+ ```bash
254
+ # Regenerate any stack-specific DB types if migration introduced schema changes
255
+ # (e.g. Supabase: scripts/gen-types.ts). Stack-specific — skip if N/A.
256
+ # REQUIRED for any stack with generated DB-typed code: missing regeneration leaves
257
+ # stale types and validate.ts may not catch runtime/schema mismatches if the PR
258
+ # does not directly touch the changed columns.
259
+
260
+ # Re-run validation against the post-migration dev state (no autofix this time):
261
+ npx tsx scripts/validate.ts --allow-dirty
262
+
263
+ # If migration or type generation produced new file changes, re-run the
264
+ # static/LLM review against the final diff — the Step 4 review is now stale.
265
+ git diff --quiet HEAD -- || npx autopilot run --base main
266
+ ```
267
+
268
+ > **v7.10+ candidate:** automate detection by reading a `post_migrate_dev: [...]` hook list from `.autopilot/stack.md` and failing Step 5 if schema changes are detected but no type-generation hook is configured. Today this is advisory — operator responsibility.
269
+
270
+ **Dev database drift** — `/migrate --env=dev` applies schema changes to dev BEFORE the PR is merged. If the PR is later rejected, substantially changed, or abandoned, dev will contain unmerged schema. Mitigations:
271
+
272
+ - **Preferred:** target an isolated/ephemeral dev database per branch (per-PR Supabase project, Postgres schema namespace, or local container).
273
+ - **Shared dev DB fallback:** before running Step 5, capture the current migration_state head; on PR abandonment, run a corrective migration that brings dev back to that head. Document the policy in `.autopilot/stack.md` so the team knows the cleanup contract.
274
+
275
+ If migration fails:
276
+ - **Before retrying:** confirm the dev migration rolled back cleanly. Some migration tools leave partial state on failure (non-transactional DDL, drift in migration_state table). If partial state exists, reset the dev database to the pre-migration baseline OR write a corrective migration that brings dev back to a known state.
277
+ - Fix the SQL
278
+ - Re-run Step 4 (validate) against the corrected code
279
+ - Re-run Step 5 (migrate dev)
280
+ - Max 3 retry iterations
281
+
282
+ **For prod-safety policy**, projects should set in `.autopilot/stack.md`:
283
+
284
+ ```yaml
285
+ migrate:
286
+ skill: <your-skill>@1
287
+ policy:
288
+ require_dry_run_first: true # always preview before apply
289
+ require_manual_approval: true # CI gates prod on human approval
290
+ require_clean_git: true # never apply against dirty tree
291
+ ```
292
+
293
+ As of v7.9.1, the valid keys per `presets/schemas/migrate.schema.json` are: `allow_prod_in_ci`, `require_clean_git`, `require_manual_approval`, `require_dry_run_first`. If the schema changes in a future release, that file is the source of truth.
294
+
295
+ These keys are **declarative policy inputs**, not autopilot enforcement. Your CI/CD pipeline must explicitly invoke a migrate runner that reads `.autopilot/stack.md` AND enforces equivalent checks (e.g. require-clean-git, dry-run-before-apply, manual-approval-for-prod). Setting them in stack.md alone does not protect production unless your pipeline reads them.
296
+
297
+ **Minimum CI/CD migrate-runner contract** (fail closed on all of these):
298
+
299
+ 1. Refuse to apply if `.autopilot/stack.md` cannot be read or parsed.
300
+ 2. Refuse to apply if the policy block contains unknown keys (schema-validate against `presets/schemas/migrate.schema.json`).
301
+ 3. If `require_dry_run_first: true`: refuse apply without a matching dry-run artifact for the current git head + target env.
302
+ 4. If `require_manual_approval: true` and `env != dev`: require an explicit human approval signal (CI approval gate, signed commit, etc.) before apply.
303
+ 5. If `require_clean_git: true`: refuse to apply against a dirty working tree (untracked or unstaged changes). This is intentionally limited to working-tree cleanliness — commit topology (squash vs rebase vs merge vs tag) is a separate concern not enforced by this key.
304
+ 6. If `allow_prod_in_ci: false` (the default): refuse prod apply from any CI context. **Note:** teams whose intended prod migration path is CI/CD with manual approval must explicitly set `allow_prod_in_ci: true` alongside `require_manual_approval: true` and `require_dry_run_first: true`. `allow_prod_in_ci: false` is for manual/operator-run production migration workflows only.
305
+ 7. On any policy-read or schema-validation failure, exit non-zero and surface the specific check that failed.
243
306
 
244
307
  ### Step 6: Push + create PR
245
308
 
@@ -257,6 +320,7 @@ npx tsx scripts/codex-pr-review.ts <pr-number>
257
320
  Posts Codex review as a GitHub PR comment. **This serves as the second risk-tiered pass for medium-risk specs and the third pass for high-risk specs.** Remediate CRITICAL findings:
258
321
  - Fix on the branch
259
322
  - Push
323
+ - **Before consuming a re-review iteration, compute the failure fingerprint** with `computeFingerprint({ phase: 'codex-review', errorType: 'codex_critical', errorLocation, errorMessage })` from `src/core/run-state/sameness-detector.ts`. For `errorLocation` prefer a stable composite — `${normalized-title}::${repo-relative-path}` — over the transport-level finding ID (Codex regenerates finding IDs on each review run, so the raw ID is not stable). For `errorMessage` use the first sentence of the finding body, stripping any per-run identifiers (run-id, timestamps, file SHAs). Append to an in-memory list for this retry loop and call `shouldEscalate(history)`. If `escalate === true` — the same critical finding fired twice after a remediation attempt — STOP and surface to the user. Do not consume another re-review.
260
324
  - Re-run Codex review
261
325
  - Max 2 iterations
262
326
 
@@ -271,6 +335,7 @@ npx tsx scripts/bugbot.ts --pr <pr-number>
271
335
  Triages each finding (real bug vs false positive), auto-fixes real bugs, dismisses false positives with GitHub replies. If fixes applied:
272
336
  - Push
273
337
  - Wait for new bugbot comments (30s)
338
+ - **Before consuming a bugbot retry, compute the failure fingerprint** with `computeFingerprint({ phase: 'bugbot', errorType: <'bugbot_high' | 'bugbot_medium'>, errorLocation, errorMessage })` from `src/core/run-state/sameness-detector.ts`. For `errorLocation` prefer the stable `<repo-relative-path>:<line>` rather than the GitHub comment ID — Cursor reposts findings with fresh comment IDs after each push, so comment ID is not stable across retries. The comment ID can be surfaced as metadata for human inspection but should not enter the hash. For `errorMessage` use the first 200 chars of the finding body. Append to an in-memory list for this retry loop and call `shouldEscalate(history)`. If `escalate === true` — a "fixed" finding re-fired identically — STOP and surface the fingerprint to the user. Do not consume another round.
274
339
  - Re-run /bugbot
275
340
  - Max 3 rounds
276
341
 
@@ -284,15 +349,44 @@ Tell the user:
284
349
  - Bugbot triage summary (fixed / dismissed / needs-human)
285
350
  - Any human-required items that couldn't be auto-resolved
286
351
 
352
+ ## Retry-loop sameness detector (Steps 4, 7, 8)
353
+
354
+ As of v7.10.0, the three retry loops (Step 4 validate, Step 7 Codex PR review, Step 8 bugbot) MUST consult `src/core/run-state/sameness-detector.ts` before consuming a retry. The pipeline halts when retries make no progress — even if you have retries remaining.
355
+
356
+ **The detector is consumed by the autopilot skill agent itself** (the LLM following this skill), not by the underlying `scripts/validate.ts` / `scripts/codex-pr-review.ts` / `scripts/bugbot.ts` CLI scripts. Those scripts are stateless per-invocation; the retry loop with its history lives inside this skill execution scope. That is why the integration is in this document and not in the CLI source.
357
+
358
+ ```ts
359
+ // Public package import (uses the subpath export added in v7.10.0):
360
+ import {
361
+ computeFingerprint,
362
+ shouldEscalate,
363
+ } from '@delegance/claude-autopilot/run-state/sameness-detector';
364
+
365
+ // Or from a local checkout:
366
+ import { computeFingerprint, shouldEscalate } from './src/core/run-state/sameness-detector.ts';
367
+ ```
368
+
369
+ How to use inside each retry loop:
370
+
371
+ 1. On a FAIL, derive `{ phase, errorType, errorLocation, errorMessage }` for the most salient blocker.
372
+ 2. `const fp = computeFingerprint({...})`.
373
+ 3. Append `fp` to an in-memory list for THIS retry loop (no cross-loop state — bugbot and validate keep separate histories).
374
+ 4. `const decision = shouldEscalate(history)` — if `decision.escalate === true`, STOP. Surface `decision.fingerprint` and `decision.reason` to the user. Do not consume another retry attempt even if the per-loop max isn't reached.
375
+ 5. Otherwise apply the fix, run again.
376
+
377
+ Persistence is in-memory only in v7.10.0. The v6 run-state events.ndjson integration is tracked as issue #180.
378
+
287
379
  ## Error Recovery
288
380
 
289
381
  - **Preflight failure:** Surface the specific check, exit. Do not partially run.
290
382
  - **Missing skill/credential:** Exit with install/auth hint.
291
383
  - **Subagent failure:** Re-dispatch with more context or implement directly.
292
- - **Migration failure:** Fix SQL, re-run `/migrate`.
293
- - **Validate failure:** Fix issues, re-run (max 3 retries).
294
- - **Codex CRITICAL findings:** REMEDIATE (apply fix), push, re-review (max 2 retries). Do NOT continue past unremediated CRITICALs.
295
- - **Bugbot findings:** `/bugbot` handles triage + fix automatically (max 3 rounds).
384
+ - **Migration failure (Step 5 — dev only):** Fix SQL, re-run Step 4 (validate) against corrected code, then re-run Step 5 (migrate dev). Prod stays untouched. Max 3 retries.
385
+ - **Prod migration:** Out of scope for autopilot. Handed off to user's CI/CD pipeline via `migrate.policy` (require_manual_approval, require_dry_run_first).
386
+ - **Validate failure:** Fix issues, re-run (max 3 retries, sameness-detector may halt earlier).
387
+ - **Codex CRITICAL findings:** REMEDIATE (apply fix), push, re-review (max 2 retries, sameness-detector may halt earlier). Do NOT continue past unremediated CRITICALs.
388
+ - **Bugbot findings:** `/bugbot` handles triage + fix automatically (max 3 rounds, sameness-detector may halt earlier).
389
+ - **Sameness detector halt:** The same failure fingerprint fired in two consecutive retry attempts — the loop is not making progress. Stop, surface the fingerprint, ask the human to inspect.
296
390
  - **External hard-block** (TCC, network, etc.): Stop, report what was completed, surface the blocker.
297
391
 
298
392
  ## When NOT to use