npm - @harness-engineering/cli - Versions diffs - 1.23.1 → 1.23.2 - Mend

@harness-engineering/cli 1.23.1 → 1.23.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (423) hide show

package/dist/agents/skills/claude-code/harness-observability/SKILL.md CHANGED Viewed

@@ -268,6 +268,16 @@ Phase 4: VALIDATE
   Result: WARN -- 3 instrumentation gaps, alerting needs SLO alignment
 ```
+## Rationalizations to Reject
+| Rationalization                                                                        | Reality                                                                                                                                                                                                                                                                                                     |
+| -------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We can see what's happening in CloudWatch logs — we don't need structured logging"    | Unstructured log lines cannot be queried, aggregated, or correlated across services. When an incident spans three services, searching for a request ID across unstructured logs is manual forensics. Structured logging is not a nicety — it is the foundation for incident response.                       |
+| "We'll add alerting once we've seen a few incidents and know what to alert on"         | The first incident is the worst time to define alerting. SLO-based burn rate alerts can be defined from traffic patterns before any incidents occur. Waiting for incidents to define thresholds means every early failure goes undetected.                                                                  |
+| "User ID is a useful label for the latency metric — it helps us debug per-user issues" | User ID as a metric label creates one time series per user, which at 100,000 users means 100,000 label combinations. High-cardinality labels exhaust metric storage, cause query timeouts, and make the entire metrics system unstable. Use logs for per-user debugging; use metrics for aggregate signals. |
+| "The tracing library is initialized, so we have distributed tracing"                   | Initializing the library creates root spans but does not propagate context across HTTP boundaries, instrument database calls, or connect traces to logs. Trace initialization without verified end-to-end propagation produces disconnected, useless traces.                                                |
+| "We have alerts — they're just not linked to runbooks yet"                             | An alert that fires at 3am without a runbook link requires the on-call engineer to start debugging from scratch. The absence of a runbook is not a documentation gap; it is a mean-time-to-recover multiplier.                                                                                              |
 ## Gates
 - **No sensitive data in logs.** If PII, credentials, or tokens are detected in log output, it is a blocking finding. The logging configuration must sanitize or redact sensitive fields before any other improvements are made.

package/dist/agents/skills/claude-code/harness-onboarding/SKILL.md CHANGED Viewed

@@ -23,7 +23,7 @@
    - Constraints and forbidden patterns
    - Any special instructions or warnings
-2. **Read `harness.yaml`.** Extract:
+2. **Read `harness.config.json`.** Extract:
    - Project name and stack
    - Adoption level (basic, intermediate, advanced)
    - Layer definitions and their directory mappings
@@ -48,7 +48,7 @@
 2. **Map the architecture.** Walk the directory structure and identify:
    - Top-level organization pattern (monorepo, single package, workspace)
    - Source code location and entry points
-   - Layer boundaries (from `harness.yaml` and actual directory structure)
+   - Layer boundaries (from `harness.config.json` and actual directory structure)
    - Shared utilities or common modules
    - Configuration files and their purposes
@@ -61,7 +61,7 @@
    - Code formatting (detect from config files: `.prettierrc`, `.eslintrc`, `biome.json`)
 4. **Map the constraints.** Identify what is restricted:
-   - Forbidden imports (from `harness.yaml` dependency constraints)
+   - Forbidden imports (from `harness.config.json` dependency constraints)
    - Layer boundary rules (which layers can import from which)
    - Linting rules that encode architectural decisions
    - Any constraints documented in `AGENTS.md` that are not yet automated
@@ -95,8 +95,8 @@ Graph queries produce a complete architecture map in seconds, including transiti
 ### Phase 3: ORIENT — Identify Adoption Level and Maturity
-1. **Confirm the adoption level** matches what `harness.yaml` declares:
-   - Basic: `AGENTS.md` and `harness.yaml` exist but no layers or constraints
+1. **Confirm the adoption level** matches what `harness.config.json` declares:
+   - Basic: `AGENTS.md` and `harness.config.json` exist but no layers or constraints
    - Intermediate: Layers defined, dependency constraints enforced, at least one custom skill
    - Advanced: Personas, state management, learnings, CI integration
@@ -184,21 +184,29 @@ Graph queries produce a complete architecture map in seconds, including transiti
 - **`harness check-deps`** — Run to verify dependency constraints are passing, which confirms layer boundaries are respected.
 - **`harness state show`** — View current state to understand where the last session left off.
 - **`AGENTS.md`** — Primary source of project context and agent instructions.
-- **`harness.yaml`** — Source of structural configuration (layers, constraints, skills).
+- **`harness.config.json`** — Source of structural configuration (layers, constraints, skills).
 - **`.harness/learnings.md`** — Historical context and institutional knowledge.
 ## Success Criteria
-- All four configuration sources were read (`AGENTS.md`, `harness.yaml`, `.harness/learnings.md`, `.harness/state.json`)
+- All four configuration sources were read (`AGENTS.md`, `harness.config.json`, `.harness/learnings.md`, `.harness/state.json`)
 - Technology stack is accurately identified (language, framework, test runner, build tool)
 - Architecture is mapped with correct layer boundaries and dependency directions
 - Conventions are identified from actual code patterns, not assumed
-- Constraints are enumerated from both `harness.yaml` and `AGENTS.md`
+- Constraints are enumerated from both `harness.config.json` and `AGENTS.md`
 - Adoption level is confirmed (not just declared — validated)
 - A structured orientation summary is produced with all sections filled
 - The "Getting Started" section is actionable and tailored to the audience
 - `harness validate` was run and results are reported
+## Rationalizations to Reject
+| Rationalization                                                                                                            | Reality                                                                                                                                                                                          |
+| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "I can skip reading .harness/learnings.md since it is just historical notes"                                               | Learnings contain hard-won insights from previous sessions -- decisions made, gotchas discovered, patterns that worked or failed. Skipping them means repeating mistakes already diagnosed.      |
+| "The harness.config.json says intermediate, so I can report that without validation"                                       | Declared adoption level must be confirmed, not assumed. A project that declares intermediate but fails harness validate is not truly intermediate.                                               |
+| "I will map the architecture by reading the directory names since that is faster than checking conventions in actual code" | Conventions must be identified from actual code patterns, not assumed from directory structure. File naming, import style, and error handling can only be verified by reading real source files. |
 ## Examples
 ### Example: Onboarding to an Intermediate TypeScript Project
@@ -211,7 +219,7 @@ Read AGENTS.md:
   - Stack: TypeScript, Express, Vitest, PostgreSQL
   - Conventions: zod validation, repository pattern, kebab-case files
-Read harness.yaml:
+Read harness.config.json:
   - Level: intermediate
   - Layers: presentation (src/routes/), business (src/services/), data (src/repositories/)
   - Constraints: presentation → business OK, business → data OK, data → presentation FORBIDDEN
@@ -258,7 +266,7 @@ Produce orientation with all sections. Getting Started for this context:
 ```
 Read AGENTS.md — exists, minimal content
-Read harness.yaml — level: basic, no layers defined
+Read harness.config.json — level: basic, no layers defined
 No .harness/learnings.md
 No .harness/state.json
 ```

package/dist/agents/skills/claude-code/harness-parallel-agents/SKILL.md CHANGED Viewed

@@ -159,6 +159,15 @@ For each independent task, write a focused agent brief:
 - `harness validate` passes after integration
 - No agent modified files outside its declared scope
+## Rationalizations to Reject
+| Rationalization                                                                              | Why It Is Wrong                                                                                                                                |
+| -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| "These two tasks touch different functions in the same file, so they are independent enough" | If both tasks write to the same file, they are NOT independent. Even different functions in the same file creates merge conflicts.             |
+| "I verified independence manually -- no need to run check_task_independence"                 | Manual verification misses transitive dependency overlap. check_task_independence with graph-expanded analysis catches transitive conflicts.   |
+| "There are only 2 independent tasks, but parallelism would save time"                        | NOT when there are fewer than 3 independent tasks. Coordination overhead outweighs parallelism benefit for 2 tasks.                            |
+| "Each agent's tests pass, so integration is fine"                                            | Step 4 requires running the FULL test suite after integration. Parallel changes can cause integration failures that individual test runs miss. |
 ## Examples
 ### Example: Parallel Implementation of Three Independent Services

package/dist/agents/skills/claude-code/harness-perf/SKILL.md CHANGED Viewed

@@ -187,6 +187,17 @@ This phase runs only when `.bench.ts` files exist in the project. If none are fo
 - Gate decision is recorded in state
 - `harness validate` passes after enforcement
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                               | Why It Is Wrong                                                                                                                                                         |
+| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The cyclomatic complexity is 16 but the function is straightforward, so I can override the Tier 1 threshold" | Tier 1 violations are non-negotiable blockers. No merge with Tier 1 performance violations. If a threshold needs adjustment, reconfigure with documented justification. |
+| "The benchmark regression is only 6% and it is probably just noise"                                           | The noise margin (default 3%) is applied before flagging. A 6% regression on a perf-critical path exceeds the Tier 1 threshold even after noise consideration.          |
+| "The working tree has a small uncommitted change but it should not affect benchmark results"                  | No running benchmarks with a dirty working tree. Uncommitted changes invalidate benchmark results.                                                                      |
+| "I will update the baselines to match the new performance numbers rather than fixing the regression"          | Baselines must come from fresh runs against committed code. Silently moving the goalposts defeats the purpose of performance gates.                                     |
 ## Examples
 ### Example: PR with High Complexity Function

package/dist/agents/skills/claude-code/harness-perf-tdd/SKILL.md CHANGED Viewed

@@ -235,6 +235,16 @@ harness check-perf — complexity reduced from 12 to 8 (improvement)
 harness perf baselines update — new baseline saved
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                           | Reality                                                                                                                                                                                                                                                   |
+| --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The correctness test is green, I'll add the benchmark later when we know performance is an issue."       | The benchmark is not optional — it is the mechanism that defines "performance issue." Without a baseline captured at implementation time, you have nothing to compare against when a regression appears months later. Later never comes.                  |
+| "I'll skip the REFACTOR phase since the spec doesn't mention performance requirements."                   | The spec not mentioning a requirement means there is no user-facing SLO, not that performance is irrelevant. The benchmark still captures the baseline that future work must not regress from. Phase 3 is optional; the benchmark file is not.            |
+| "The benchmark results vary too much between runs to be meaningful — I'll just omit it."                  | Variance is a signal, not a reason to skip. High variance means the benchmark needs warmup iterations, more samples, or isolation from I/O. Fix the benchmark, do not delete it. An absent benchmark offers zero protection against regressions.          |
+| "This function is only called during startup, so its performance doesn't matter at runtime."              | Startup performance determines deployment speed, lambda cold-start latency, and test suite duration. "Not in the hot path at runtime" does not mean performance is free to ignore. Measure it so the baseline exists if startup behavior changes.         |
+| "We already have an integration test that covers this — writing a separate benchmark would be redundant." | Integration tests verify correctness under realistic conditions. Benchmarks measure isolated performance with precise input control. An integration test that passes in 2 seconds tells you nothing about whether the function itself takes 1ms or 800ms. |
 ## Gates
 - **No code before test AND benchmark.** Both must exist before implementation begins.

package/dist/agents/skills/claude-code/harness-planning/SKILL.md CHANGED Viewed

@@ -468,6 +468,16 @@ When `docs/changes/` exists in the project, produce `docs/changes/<feature>/delt
 - When `rigorLevel` is `standard` and task count < 8, the skeleton is skipped
 - The skeleton format is lightweight (~200 tokens): numbered groups with task count and time estimates
+## Rationalizations to Reject
+| Rationalization                                                                                               | Reality                                                                                                                                                                 |
+| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The task is conceptually clear so I do not need to include exact code in the plan"                           | Every task must have exact file paths, exact code, and exact commands. If you cannot write the code in the plan, you do not understand the task well enough to plan it. |
+| "This task touches 5 files but it is logically one unit of work, so splitting it would add overhead"          | Tasks touching more than 3 files must be split. The overhead of splitting is far less than the cost of a failed oversized task.                                         |
+| "Tests for this task can be added in a follow-up task since the implementation is straightforward"            | No skipping TDD in tasks. Every code-producing task must start with writing a test. "Add tests later" is explicitly forbidden.                                          |
+| "The spec does not cover this edge case, but I can fill in the gap during planning"                           | When the spec is missing information, do not fill in the gaps yourself. Escalate. Filling gaps silently creates undocumented design decisions that no one reviewed.     |
+| "I discovered we need an additional file during decomposition, but updating the file map is just bookkeeping" | The file map must be complete. Every file that will be created or modified must appear in the file map before task decomposition.                                       |
 ## Examples
 ### Example: Planning a User Notification Feature

package/dist/agents/skills/claude-code/harness-pre-commit-review/SKILL.md CHANGED Viewed

@@ -284,6 +284,15 @@ fi
 - [ ] AI review focused on high-signal issues only (no style nits)
 - [ ] Report follows the structured format exactly
+## Rationalizations to Reject
+| Rationalization                                                               | Why It Is Wrong                                                                                                                                               |
+| ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The lint errors are just warnings, so I can proceed to AI review"            | The gate is absolute: any mechanical check failure means STOP. AI review does not run until lint, typecheck, and tests all pass.                              |
+| "This is a docs-only change but let me run AI review anyway for thoroughness" | The fast path is mandatory. If only docs/config files changed, AI review is skipped. Running it anyway wastes tokens.                                         |
+| "The AI found a style issue, so I should block the commit"                    | AI review observations are advisory only. Only mechanical check failures block the commit.                                                                    |
+| "I will skip the security scan since this is an internal endpoint"            | Phase 3 runs the security scanner against all staged source files regardless of exposure. Hardcoded secrets and injection are blocking even in internal code. |
 ## Examples
 ### Example: Clean Commit

package/dist/agents/skills/claude-code/harness-product-spec/SKILL.md CHANGED Viewed

@@ -197,6 +197,16 @@
 - Output format matches existing project conventions when they exist
 - Generated PRD is saved to the correct directory with consistent naming
+## Rationalizations to Reject
+| Rationalization                                                                                    | Why It Is Wrong                                                                                                                                        |
+| -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The feature request is clear enough -- I can skip the ambiguity check and start writing stories"  | The gate: no generating specs from ambiguous input without clarification. Missing actors or undefined triggers lead to untestable acceptance criteria. |
+| "This acceptance criterion is understood by the team, so it does not need to be formally testable" | No untestable acceptance criteria is a hard gate. Every criterion must be verifiable by an automated test or specific manual procedure.                |
+| "The happy path scenarios are enough -- edge cases are unlikely"                                   | The skill requires at least one unwanted-behavior criterion for every user-facing action. Edge cases are where production bugs live.                   |
+| "The existing PRD is outdated, so I will just replace it with a fresh one"                         | No overwriting existing specs is a gate. Present the diff rather than replacing the file.                                                              |
+| "We can figure out the success metrics later during implementation"                                | Every success metric must be measurable, time-bound, and specific at spec time.                                                                        |
 ## Examples
 ### Example: GitHub Issue to PRD for Team Notifications

package/dist/agents/skills/claude-code/harness-property-test/SKILL.md CHANGED Viewed

@@ -266,6 +266,16 @@ def test_sort_handles_floats(xs):
         assert result[i] <= result[i + 1]
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                | Reality                                                                                                                                                                                                                                                                                                                                |
+| ------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We already have example-based tests that cover the edge cases — property tests would just be redundant."                      | Example-based tests cover the cases the author thought of. Property tests cover the cases they did not. The entire value of generative testing is that it explores regions of the input space that human intuition misses — off-by-one errors, Unicode combining characters, signed integer overflow at boundaries.                    |
+| "The generator keeps producing rejected inputs, so I'll just filter more aggressively to make the test pass faster."           | Heavy `filter` usage is a symptom of a broken generator, not a solution. Each rejected sample wastes an iteration, and `filter` destroys the shrinking chain, leaving you with an unhelpful counterexample when a bug is found. Rewrite the generator using `map` and `flatMap` to construct valid inputs directly.                    |
+| "The counterexample is too strange to be a real-world case — I'll just increase the iteration count so it appears less often." | A shrunk counterexample that triggers a property failure is a real bug by definition. "Unlikely in practice" is not a property of correctness — the question is whether the invariant holds. If the counterexample is a valid input the function might receive, fix the function. If it is not a valid input, constrain the generator. |
+| "This function has too many invariants to specify — I'll just skip property testing and trust the unit tests."                 | Complex functions with many invariants are exactly the functions most in need of property testing. High complexity means a larger bug-hiding surface. Start with the most important invariants (no-crash, round-trip, idempotence) rather than attempting to encode all properties at once.                                            |
+| "Property tests are too slow — they'll block CI for 10 minutes."                                                               | Run 100 iterations on PR, 10,000 iterations nightly. The CI time argument justifies reducing iteration count, never eliminating property tests entirely. A suite that runs 0 property tests found 0 edge cases.                                                                                                                        |
 ## Gates
 - **No property tests without shrinking.** If the framework's automatic shrinking is disabled or the generator uses patterns that break shrinking (excessive `filter`), counterexamples will be unhelpfully large. Fix the generator to support shrinking.

package/dist/agents/skills/claude-code/harness-refactoring/SKILL.md CHANGED Viewed

@@ -134,6 +134,15 @@ Skipping this step means subsequent graph queries (impact analysis, dependency h
 - No behavioral changes were introduced (the test suite is the proof)
 - No dead code was left behind (run `harness cleanup` to verify)
+## Rationalizations to Reject
+| Rationalization                                                                                                     | Reality                                                                                                                                      |
+| ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The tests are mostly passing, so I can start refactoring and fix the remaining failures as I go"                   | All tests must pass BEFORE refactoring starts. If tests are not green before you start, you are not refactoring -- you are debugging.        |
+| "This refactoring changes a small amount of behavior, but it is a clear improvement"                                | Refactoring must not change behavior. The test suite is the proof. If the refactoring requires changing tests, you may be changing behavior. |
+| "I will make several changes at once and run tests at the end since each change is small"                           | Tests must run after EVERY single change. If a test breaks, you must undo the LAST change immediately.                                       |
+| "The refactoring did not produce a measurable improvement, but the code is different so it must be somewhat better" | If the refactoring introduced no measurable improvement, revert the entire sequence. Refactoring for its own sake is churn.                  |
 ## Examples
 ### Example: Moving business logic out of a UI component

package/dist/agents/skills/claude-code/harness-release-readiness/SKILL.md CHANGED Viewed

@@ -537,6 +537,17 @@ This framing is informational — it does not block anything. It gives the team
 8. Monorepo support: each package is audited independently with per-package results in the report
 9. `harness validate` passes after the skill's SKILL.md and skill.yaml are written
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                           | Why It Is Wrong                                                                                                            |
+| ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| "The MAINTAIN phase takes too long, so I will skip dispatching the 4 maintenance agents"  | No skipping the MAINTAIN phase. Maintenance checks catch issues that release-specific checks miss.                         |
+| "This auto-fix is obviously correct, so I can apply it without prompting the user"        | No auto-fix without prompting. Every fix must be presented to the human before being applied.                              |
+| "Most checks pass and only a few warnings remain, so the release is ready"                | A "mostly passing" report is not a passing report. The result is PASS only when zero failures exist across all categories. |
+| "The previous run found these issues and I fixed them, so I can trust the cached results" | Session resumption requires re-running all checks. Code may have changed since the last run.                               |
 ## Examples
 ### Example: First Run on a Monorepo with Gaps

package/dist/agents/skills/claude-code/harness-resilience/SKILL.md CHANGED Viewed

@@ -240,6 +240,16 @@ Phase 4: VALIDATE
     Redis fallback serves from LRU when Redis is down
 ```
+## Rationalizations to Reject
+| Rationalization                                                                     | Reality                                                                                                                                                                                                                                                                                                                        |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "That third-party API has 99.99% uptime — we don't need a circuit breaker"          | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. |
+| "We have retry logic, so failures are handled"                                      | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives.                                                                                       |
+| "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production.                                                       |
+| "Our database connection pool is 100 connections — that's plenty"                   | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required.                                                                                      |
+| "The service is internal — it doesn't need rate limiting"                           | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations.                                             |
 ## Gates
 - **No retry on non-idempotent operations without idempotency keys.** Retrying a POST or DELETE that lacks an idempotency mechanism can cause data duplication or data loss. This is a blocking finding. The operation must be made idempotent before retry logic is added.

package/dist/agents/skills/claude-code/harness-roadmap/SKILL.md CHANGED Viewed

@@ -42,7 +42,7 @@ If the human has not seen and approved the milestone groupings and feature list,
    - Has spec + plan but no implementation -> `planned`
    - Has spec but no plan -> `backlog`
    - Has plan but no spec -> `planned` (unusual, flag for human review)
-6. Detect project name from `harness.yaml` `project` field, or `package.json` `name` field, or directory name as fallback.
+6. Detect project name from `harness.config.json` `project` field, or `package.json` `name` field, or directory name as fallback.
 Present scan summary:
@@ -457,6 +457,15 @@ Choice?
 19. `--query` filters features by status or milestone and displays results with milestone context
 20. `--query` errors gracefully when no roadmap exists, directing the user to `--create`
+## Rationalizations to Reject
+| Rationalization                                                                                                   | Reality                                                                                                                             |
+| ----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| "The feature list looks correct, so I can skip the PROPOSE phase and write the roadmap directly"                  | The Iron Law: never write docs/roadmap.md without the human confirming the proposed structure first.                                |
+| "This sync detected a status change and the inference is clearly correct, so I can apply it without confirmation" | The sync PROPOSE phase requires presenting proposed changes and waiting for human confirmation. The human-always-wins rule applies. |
+| "The existing roadmap is outdated, so I will recreate it with --create to get a fresh start"                      | No overwriting an existing roadmap without explicit user consent. Silent overwrites destroy prior manual edits and status tracking. |
+| "There is no roadmap yet but the user asked me to add a feature, so I will create one as a side effect of --add"  | When the roadmap does not exist, --add must error with a clear message directing the user to --create.                              |
 ## Examples
 ### Example: `--create` -- Bootstrap a Roadmap from Existing Artifacts

package/dist/agents/skills/claude-code/harness-roadmap-pilot/SKILL.md CHANGED Viewed

@@ -150,6 +150,14 @@ Proceed with Feature A? (y/n/pick another)
 7. Transition routes to brainstorming (no spec) or autopilot (spec exists)
 8. `harness validate` passes after all changes
+## Rationalizations to Reject
+| Rationalization                                                                                                         | Reality                                                                                                                                 |
+| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| "The top-scored candidate is obviously correct, so I can assign it without asking the human"                            | The Iron Law: never assign or transition without the human confirming the recommendation first.                                         |
+| "Affinity data is not available so the scoring is degraded -- I should just pick the first planned item"                | Proceed without affinity scoring by zeroing out the affinity weight. Position and dependents signals still produce meaningful rankings. |
+| "The feature has no spec, but I can skip brainstorming and jump straight to planning since the summary is clear enough" | No spec routes to brainstorming, spec exists routes to autopilot. A one-line roadmap summary is not a spec.                             |
 ## Examples
 ### Example: Pick Next Item from a Multi-Milestone Roadmap

package/dist/agents/skills/claude-code/harness-secrets/SKILL.md CHANGED Viewed

@@ -278,6 +278,16 @@ Phase 4: VALIDATE
   Result: FAIL -- rotation required before deployment, history rewrite recommended
 ```
+## Rationalizations to Reject
+| Rationalization                                             | Reality                                                                                                                                                                                                                                              |
+| ----------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "That key is read-only so it's not a big deal if it leaks"  | Read-only credentials still enable data exfiltration, reconnaissance, and discovery of other vulnerabilities. A leaked read-only database credential exposes every row in the database. Scope does not eliminate risk.                               |
+| "We removed it from the file — it's cleaned up now"         | Removing a secret from the current tree does not remove it from git history. Anyone with a clone of the repository can recover the secret with `git log -p`. Rotation is required regardless of file deletion.                                       |
+| "That's a test environment key, not production"             | Test environment credentials are frequently reused, shared informally, and rotated less often. Leaked test keys also reveal credential patterns and naming conventions that help attackers guess production secrets.                                 |
+| "It's in a private repo so only our team can see it"        | Private repos are accessed by CI/CD systems, third-party integrations, contractors, and former employees. Repository access controls are not a substitute for secret externalization. Breaches routinely originate from compromised internal access. |
+| "We'll move it to an environment variable before we deploy" | Intent does not prevent exposure. The secret is in the codebase now and may already be in commit history, CI logs, or developer machine caches. Remediation must happen at the moment of detection, not at deployment time.                          |
 ## Gates
 - **No CRITICAL findings may remain unaddressed.** Production credentials exposed in source code are blocking. Execution halts until the credential is rotated and the code is remediated.

package/dist/agents/skills/claude-code/harness-security-review/SKILL.md CHANGED Viewed

@@ -174,6 +174,16 @@ Threat Model:
 - **`query_graph` / `get_relationships`** — Used in threat modeling phase for data flow tracing
 - **`get_impact`** — Understand blast radius of security-sensitive changes
+## Rationalizations to Reject
+| Rationalization                                                                        | Reality                                                                                                                                                                                                                                                                                                  |
+| -------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The scanner didn't flag it so it must be fine"                                        | Mechanical scanners catch pattern-level issues. They cannot trace user input across multiple function calls to a dangerous sink, detect authorization logic flaws, or evaluate whether a fallback chain fails open. The AI review phase exists precisely because scanners miss semantic vulnerabilities. |
+| "This endpoint is behind authentication so we don't need to validate input"            | Authentication and input validation are orthogonal controls. Authenticated users can still send malicious payloads. Authenticated SQL injection, SSRF, and path traversal are well-documented attack patterns against internal-only endpoints.                                                           |
+| "The vulnerability requires knowing our internal schema to exploit"                    | Security through obscurity is not a control. Internal schema details leak through error messages, API responses, documentation, and employee turnover. Rate the vulnerability based on its impact assuming the attacker knows the system.                                                                |
+| "We'll add rate limiting and input validation later once the feature ships"            | Security controls added after deployment require re-testing and re-review. Shipping without them creates an exposure window and establishes technical debt that is systematically deprioritized once the feature is live.                                                                                |
+| "That's an OWASP theoretical risk — our app isn't targeted by sophisticated attackers" | OWASP findings are exploited by automated scanners, not just sophisticated attackers. Opportunistic bots continuously probe for SQL injection, XSS, and auth bypass. Unpatched OWASP Top 10 issues are routinely exploited within hours of exposure.                                                     |
 ## Gates
 - **Mechanical scanner must run before AI review.** The scanner catches what patterns can catch; AI reviews what remains.

package/dist/agents/skills/claude-code/harness-security-scan/SKILL.md CHANGED Viewed

@@ -94,21 +94,11 @@ These apply to ALL skills. If you catch yourself doing any of these, STOP.
 ## Rationalizations to Reject
-### Universal
-These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
-- **"It's probably fine"** — "Probably" is not evidence. Verify before asserting.
-- **"This is best practice"** — Best practice in what context? Cite the source and
-  confirm it applies to this codebase.
-- **"We can fix it later"** — If it is worth flagging, it is worth documenting now
-  with a concrete follow-up plan.
-### Domain-Specific
-- **"No attacker would find this"** — Security by obscurity. If the code is wrong, flag it regardless of discoverability.
-- **"We're behind a firewall"** — Network boundaries change. Code should be secure at every layer regardless of deployment topology.
-- **"The framework handles this for us"** — Verify the framework's actual behavior. Misuse of a secure framework is still insecure.
+| Rationalization                     | Reality                                                                                            |
+| ----------------------------------- | -------------------------------------------------------------------------------------------------- |
+| "No attacker would find this"       | Security by obscurity. If the code is wrong, flag it regardless of discoverability.                |
+| "We're behind a firewall"           | Network boundaries change. Code should be secure at every layer regardless of deployment topology. |
+| "The framework handles this for us" | Verify the framework's actual behavior. Misuse of a secure framework is still insecure.            |
 ## Escalation

package/dist/agents/skills/claude-code/harness-skill-authoring/SKILL.md CHANGED Viewed

@@ -121,11 +121,29 @@ depends_on:
 8. **For rigid skills, write `## Escalation`.** Define when to stop and ask for help. Each escalation condition should describe the symptom, the likely cause, and what to report.
+9. **Write `## Rationalizations to Reject`.** Every user-facing skill must include this section. It contains domain-specific rationalizations that prevent agents from skipping steps with plausible-sounding excuses. Format requirements:
+   - **Table format:** `| Rationalization | Reality |` with a header separator row
+   - **3-8 entries** per skill, each specific to the skill's domain
+   - **No generic filler.** Every entry must address a rationalization that is plausible in the context of this specific skill
+   - **Do not repeat universal rationalizations.** The following three are always in effect for all skills and must NOT appear in individual skill tables:
+   | Rationalization         | Reality                                                                     |
+   | ----------------------- | --------------------------------------------------------------------------- |
+   | "It's probably fine"    | "Probably" is not evidence. Verify before asserting.                        |
+   | "This is best practice" | Best practice in what context? Cite the source and confirm it applies here. |
+   | "We can fix it later"   | If worth flagging, document now with a concrete follow-up plan.             |
+   Example of a good domain-specific entry (for a code review skill):
+   | Rationalization                               | Reality                                                                                                                                    |
+   | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+   | "The tests pass so the logic must be correct" | Passing tests prove the tested paths work. They say nothing about untested paths, edge cases, or whether the tests themselves are correct. |
 ### Phase 5: VALIDATE — Verify the Skill
 1. **Run `harness skill validate`** to check:
    - `skill.yaml` has all required fields and valid values
-   - `SKILL.md` has all required sections (`## When to Use`, `## Process`, `## Harness Integration`, `## Success Criteria`, `## Examples`)
+   - `SKILL.md` has all required sections (`## When to Use`, `## Process`, `## Harness Integration`, `## Success Criteria`, `## Examples`, `## Rationalizations to Reject`)
    - Rigid skills have `## Gates` and `## Escalation` sections
    - The `name` in `skill.yaml` matches the directory name
    - Referenced tools exist
@@ -174,6 +192,16 @@ Use this checklist as a final quality gate before declaring a skill complete.
 - Rigid skills include Gates and Escalation sections with specific conditions and consequences
 - The skill can be loaded and run with `harness skill run <name>`
+## Rationalizations to Reject
+| Rationalization                                                         | Reality                                                                                                                                  |
+| ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| "This skill is too simple to need all required sections"                | Every section exists for a reason. A short section is fine; a missing section means the skill was not fully thought through.             |
+| "The process section covers it — no need for explicit success criteria" | Process describes what to do. Success criteria describe how to know it worked. They serve different purposes.                            |
+| "Rationalizations to Reject is meta — this skill does not need it"      | This section is required for all user-facing skills, including this one. No exceptions.                                                  |
+| "I will add examples later once the skill is proven"                    | Examples are a required section. A skill without examples forces the agent to guess at correct behavior. Write at least one example now. |
+| "The When to Use section is obvious from the name"                      | Negative conditions (when NOT to use) prevent misapplication. The skill name conveys nothing about boundary conditions.                  |
 ## Examples
 ### Example: Creating a Flexible Skill for Database Migration Review

package/dist/agents/skills/claude-code/harness-soundness-review/SKILL.md CHANGED Viewed

@@ -1147,6 +1147,16 @@ These criteria validate the skill implementation artifacts. The behavioral succe
 7. `harness validate` passes after all files are written
 8. The skill test suite passes (structure, schema, platform-parity, references)
+## Rationalizations to Reject
+| Rationalization                                                                                          | Reality                                                                                                                                                                        |
+| -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The spec looks coherent to me, so I can skip running the S1 internal coherence check"                   | Every check in the mode must run. S1 detects contradictions between decisions, technical design, and success criteria that human review frequently misses.                     |
+| "This unstated assumption is obvious, so documenting it would be pedantic"                               | S3 exists because "obvious" assumptions cause the most damage when they turn out to be wrong. Obvious assumptions are the cheapest to document and the most expensive to miss. |
+| "The success criterion is somewhat vague but the team will know what it means"                           | S7 flags vague criteria like "should be fast" because they are untestable. Vague criteria survive brainstorming and planning only to fail at verification.                     |
+| "This auto-fixable finding is minor, so I will just note it rather than applying the fix"                | Auto-fixable findings should be applied silently -- that is the design intent. Skipping them means the spec ships with known inferrable gaps.                                  |
+| "The feasibility check found a signature mismatch but the code can probably be adapted during execution" | S5 feasibility red flags are always severity "error" and always surface to the user. A spec that references nonexistent modules will produce a broken plan.                    |
 ## Examples
 ### Example: Spec Mode Invocation

package/dist/agents/skills/claude-code/harness-sql-review/SKILL.md CHANGED Viewed

@@ -300,6 +300,16 @@ Phase 4: VALIDATE
   Report: NEEDS_ATTENTION (1 N+1 error, 1 missing index, 1 query rewrite)
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                   | Reality                                                                                                                                                                                                                                                                                    |
+| ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The ORM handles query optimization automatically"                                                | ORMs generate syntactically correct queries but do not detect N+1 patterns, choose optimal join strategies, or add missing indexes. The ORM executes what the code asks for. A `findMany` followed by per-item `findUnique` calls in a loop is an N+1 regardless of which ORM executes it. |
+| "That endpoint is only called by admins so performance doesn't matter"                            | Admin endpoints frequently become user-facing as products grow. An N+1 query on a 10-row table becomes a crisis when the table grows to 100,000 rows. Query correctness should not be conditional on current data volume.                                                                  |
+| "We can add indexes later if performance becomes a problem"                                       | Adding indexes to large production tables requires exclusive locks or online rebuild procedures that carry risk. Identifying and adding the correct index during development, before the table grows, costs minutes instead of hours of planned maintenance.                               |
+| "That DELETE without a WHERE clause is wrapped in application logic that only calls it correctly" | Application logic has bugs. A missing WHERE clause is a single misrouted request away from deleting the entire table. Database safety constraints must not depend on application-layer correctness.                                                                                        |
+| "The query is fast in development — the test database only has 100 rows"                          | Development databases do not represent production query plans. Full table scans, missing indexes, and N+1 patterns only manifest at production data volumes. Static analysis catches these issues regardless of local data size.                                                           |
 ## Gates
 - **No approving N+1 queries in user-facing hot paths.** An N+1 query in an endpoint called per page load is always an error. It must be fixed with eager loading, batching, or a JOIN before the PR can merge.

package/dist/agents/skills/claude-code/harness-state-management/SKILL.md CHANGED Viewed

@@ -218,6 +218,16 @@ Treat learnings as a first-class project artifact. They are as valuable as tests
 - State is saved before session end with an accurate session summary
 - State files are committed to git separately from code changes
+## Rationalizations to Reject
+| Rationalization                                                                                                       | Reality                                                                                                                                                                                                                                                                                                   |
+| --------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The session is short — I'll update state at the end rather than after each task."                                    | Context resets happen without warning. A session that ends mid-task with no state update forces the next session to reconstruct position by reading git history and code, which takes longer and produces an inaccurate picture. State is updated after each task, not at the end of the session.         |
+| "This decision is obvious from the code — I don't need to record the rationale in state."                             | What is obvious to the agent that made the decision is opaque to the agent that resumes three weeks later with no memory of the session. Decisions are recorded because the context that made them obvious does not survive a context reset. The rationale is exactly what needs to be saved.             |
+| "The learnings file is getting long — I'll trim old entries that are no longer relevant."                             | Learnings are append-only by design. An entry that seems irrelevant may become relevant when a related pattern resurfaces. Trimming destroys the chronological record and the ability to understand why earlier decisions were made. Entries are never deleted, only supplemented with corrections.       |
+| "I can re-read the plan to figure out where I am — I don't need to update the position in state."                     | The plan describes what to do; state records what has been done. Re-reading the plan without state requires the next session to infer progress from code, which produces uncertain position. Uncertain position leads to re-executing completed tasks or skipping tasks that appear complete but are not. |
+| "The stream auto-resolves from the branch — I don't need to explicitly verify which stream is active before writing." | Auto-resolution works when branch names match stream names and the index is current. When branches are renamed, stale, or when multiple streams exist for the same feature, auto-resolution can write to the wrong stream silently. Always announce the resolved stream before writing state.             |
 ## Examples
 ### Example: Starting a New Session (Resuming Work)

package/dist/agents/skills/claude-code/harness-supply-chain-audit/SKILL.md CHANGED Viewed

@@ -222,6 +222,16 @@ Do not assert risk scores without citing the specific data point that generated
 - **If the npm or GitHub API is unavailable:** Note which factors were skipped with "unknown" scores. Do not fail the audit — partial results are better than none.
 - **If the user asks for a verdict ("is this safe?"):** Decline to give a binary answer. Supply chain risk is probabilistic. Present the risk signals and let the human decide.
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                             | Why It Is Wrong                                                                                                                                        |
+| ------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "This package has high risk signals but it is widely used, so it must be safe"              | The Iron Law: present findings as flags for human review, never as verdicts. Popularity does not eliminate bus-factor risk or maintenance abandonment. |
+| "The npm API returned an error for this package, so I will skip it and move on"             | API failures produce "unknown" scores with a note, not skips. Partial results with noted gaps are always better than incomplete audits.                |
+| "The install script is probably just native addon compilation, so I do not need to flag it" | Every install script must be flagged in the report. "Probably legitimate" is exactly the assumption that supply chain attacks exploit.                 |
 ## Examples
 ```

package/dist/agents/skills/claude-code/harness-tdd/SKILL.md CHANGED Viewed

@@ -114,6 +114,15 @@ Repeat the 4 phases for each new behavior. A typical feature requires 3-10 cycle
 - No test tests implementation details (only observable behavior)
 - No production code exists that was not demanded by a failing test
+## Rationalizations to Reject
+| Rationalization                                                                                     | Reality                                                                                                                                                                                             |
+| --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "I know exactly what the implementation should be, so I will write it first and add the test after" | Code before test equals delete it. The gate is explicit: if production code is written before a failing test exists, delete the production code and start correctly.                                |
+| "The test passed on the first run, so TDD is working"                                               | If the test passed without implementing the production code, either the behavior already exists or the test is wrong. You must watch the test FAIL for the right reason before proceeding to GREEN. |
+| "I will test multiple behaviors in this one test to be efficient"                                   | One test, one assertion, one behavior. Multi-behavior tests make it impossible to pinpoint which behavior broke when the test fails.                                                                |
+| "Harness validate can wait until the end of the feature since it slows down the cycle"              | No skipping VALIDATE. Every cycle must end with harness check-deps and harness validate. A passing test with a failing validation means the implementation violated a project constraint.           |
 ## Examples
 ### Example: Adding a `calculateTotal` function

package/dist/agents/skills/claude-code/harness-test-advisor/SKILL.md CHANGED Viewed

@@ -126,6 +126,16 @@ npx vitest run tests/services/auth.test.ts tests/types/user.test.ts tests/routes
 - Report follows the structured output format
 - All findings are backed by graph query evidence (with graph) or systematic static analysis (without graph)
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                    | Why It Is Wrong                                                                                                                                                                     |
+| -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "Only the Tier 1 direct tests matter -- Tier 2 and Tier 3 are probably unnecessary"                | Tier 2 tests catch indirect breakage one hop away. A change to auth.ts breaks login.ts which breaks login.test.ts. Skipping Tier 2 misses exactly the regressions hardest to debug. |
+| "The changed file has no tests, but that is not my concern -- I just advise on which tests to run" | Coverage gaps must be flagged. When a changed file has no test coverage, the advisor reports it. Silently producing an empty test list gives false confidence.                      |
+| "The graph is stale but I will use it anyway since some data is better than no data"               | If the graph is more than 10 commits behind, refresh before proceeding. Staleness sensitivity is Medium for test advisor.                                                           |
 ## Examples
 ### Example: Selecting Tests for a Services Change