npm - @harness-engineering/cli - Versions diffs - 1.23.0 → 1.23.2 - Mend

@harness-engineering/cli 1.23.0 → 1.23.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (423) hide show

package/dist/agents/skills/gemini-cli/harness-integration-test/SKILL.md CHANGED Viewed

@@ -256,6 +256,15 @@ describe('ProjectService contract', () => {
 });
 ```
+## Rationalizations to Reject
+| Rationalization                                                                           | Why It Is Wrong                                                                                                                                             |
+| ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "Testing the happy path is sufficient -- error scenarios are edge cases"                  | The success criteria require error scenarios (400, 401, 403, 404, 500, timeout) for all public endpoints. Error paths are where real-world failures happen. |
+| "We can test against the staging environment instead of setting up local mocks"           | No integration tests that require external staging environments for CI. Tests must run with local test doubles.                                             |
+| "The consumer contract changed, so I will update the consumer test to match the provider" | Contract changes must be coordinated. The provider may have introduced a bug, not an intentional change.                                                    |
+| "Tests pass when I run them in order, so they are fine"                                   | Phase 4 requires running tests in random order. Any test that fails only in a specific order has a shared-state bug.                                        |
 ## Gates
 - **No integration tests that require external staging environments for CI.** Every integration test must run with local test doubles (mocks, containers, in-memory databases). Tests that fail without a staging VPN are not integration tests -- they are environment tests.

package/dist/agents/skills/gemini-cli/harness-integrity/SKILL.md CHANGED Viewed

@@ -122,6 +122,16 @@ Rules:
 - [ ] Unified report follows the exact format
 - [ ] Overall verdict correctly reflects both mechanical and review results
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                                | Why It Is Wrong                                                                                                                                    |
+| -------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "All three mechanical checks failed, but I should still run the AI review to get useful feedback"              | When ALL three checks fail, stop immediately. Do not proceed to Phase 2. AI review on code that does not compile is wasted effort.                 |
+| "The security scanner found a warning but it is not high severity, so it should not affect the overall result" | Error-severity security findings are blocking. The distinction is severity, not the agent's opinion of importance.                                 |
+| "The AI review flagged an architectural concern as blocking, so the integrity check should fail"               | Only runtime errors, data loss, and security vulnerabilities count as blocking review findings. Architectural concerns are noted but do not block. |
 ## Examples
 ### Example: All Clear

package/dist/agents/skills/gemini-cli/harness-knowledge-mapper/SKILL.md CHANGED Viewed

@@ -162,6 +162,15 @@ This ensures subsequent graph queries (impact analysis, drift detection) include
 - Report follows the structured output format
 - All findings are backed by graph query evidence (with graph) or directory/file analysis (without graph)
+## Rationalizations to Reject
+| Rationalization                                                                         | Why It Is Wrong                                                                                                                                                   |
+| --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The graph is a few commits behind, but it is close enough for knowledge mapping"       | If the graph is more than 10 commits behind, run harness scan before proceeding. A stale graph produces a knowledge map with missing modules.                     |
+| "No graph exists, so this skill cannot produce useful output"                           | The fallback strategy is explicit: use directory structure and file analysis. Fallback completeness is ~50%, significantly better than nothing.                   |
+| "The existing AGENTS.md is outdated, so I will overwrite it with the generated version" | Never overwrite without confirmation. Existing AGENTS.md may contain carefully authored context the graph cannot infer.                                           |
+| "The module descriptions I inferred from function names are accurate enough"            | Inferred descriptions are starting points. Phase 3 (AUDIT) exists to identify coverage gaps. Name-based inference misses purpose, constraints, and relationships. |
 ## Examples
 ### Example: Generating AGENTS.md from Graph

package/dist/agents/skills/gemini-cli/harness-load-testing/SKILL.md CHANGED Viewed

@@ -259,6 +259,16 @@ Phase 4: ANALYZE
   Recommendation: Add DataLoader for orders resolver, re-test after fix
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                             | Reality                                                                                                                                                                                                                                                                  |
+| ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The smoke test passed, so the full load test will probably be fine too."                                   | A smoke test at 1-2 VUs tells you the script runs — it says nothing about behavior at 100 or 1000 VUs. Connection pool exhaustion, lock contention, and GC pressure only appear under load. Smoke passing is the floor, not the ceiling.                                 |
+| "Staging is smaller than production, so results won't be accurate anyway — no point running the full test." | Staging results are always useful as a proxy: they reveal algorithmic bottlenecks, N+1 queries, and missing indexes that scale identically regardless of instance count. Document the scale factor and use it. Do not skip testing because the environment is imperfect. |
+| "We haven't changed the API, so the old load test baselines still apply."                                   | Baselines go stale when dependencies update, traffic patterns shift, or adjacent services change. A deployment that adds one middleware layer or changes a database index can move p99 by 200ms. Baselines must be re-validated, not assumed.                            |
+| "The p95 threshold is arbitrary — let's just relax it until the test passes."                               | A threshold without a documented basis is a guess. A threshold lowered to make a failing test pass is a suppressed regression. Thresholds must be derived from SLOs or measured baselines. If the SLO is wrong, change the SLO explicitly with stakeholder sign-off.     |
+| "We'll run the soak test later — we just need to ship the load test first."                                 | Soak tests catch failures that only emerge over hours: memory leaks, connection pool exhaustion, log file growth. If the feature involves a long-lived process, background worker, or WebSocket, skipping the soak test means the failure surfaces in production.        |
 ## Gates
 - **No load tests against production without explicit human approval.** Load tests can cause real outages. The target environment must be verified as non-production before execution. If production testing is required, a `[checkpoint:human-verify]` must be passed with documented approval.

package/dist/agents/skills/gemini-cli/harness-ml-ops/SKILL.md CHANGED Viewed

@@ -326,6 +326,16 @@ Phase 4: VALIDATE
   After fixes: projected NEEDS_ATTENTION (missing precision/recall metrics)
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                         | Reality                                                                                                                                                                                                                                                                                                                                 |
+| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We re-trained with more data but the architecture is the same — the previous evaluation still applies."                | Evaluation results are bound to a specific model artifact, not to the architecture. A re-trained model with different weights can have dramatically different failure modes even if accuracy appears similar. Every model version that goes to production must be evaluated against the golden set, not inherited from its predecessor. |
+| "The model file is only 8MB — committing it to git is more convenient than setting up an artifact store."               | Model files in git corrupt repository history, explode clone times for all contributors, and cannot be versioned alongside experiment metadata. Convenience now creates permanent technical debt. The artifact store setup is a one-time cost; git pollution is permanent.                                                              |
+| "Loading the model inside the request handler is simpler — the model is small enough that latency won't be noticeable." | Per-request model loading adds I/O and deserialization on every inference call, holds no persistent state across requests, and collapses under any meaningful concurrency. "Small enough" is a guess without measurement. Models must be loaded at startup and held in memory.                                                          |
+| "We can add experiment tracking after we get the model working — right now we just need to iterate quickly."            | Experiment tracking is hardest to add retroactively because you cannot reconstruct the conditions of runs you did not log. The runs being executed without tracking right now are the ones producing the model that may go to production. Log them now or accept that the model is not reproducible.                                    |
+| "The prompt template is short enough to read in context — version controlling it adds unnecessary process."             | Prompts embedded in application code change silently when developers edit them, have no history of what changed and why, and cannot be evaluated independently. A prompt is a model artifact. It requires the same versioning, evaluation, and promotion discipline as model weights.                                                   |
 ## Gates
 - **No deploying models without evaluation.** A model that has not been evaluated against a golden set or baseline cannot be promoted to production. This is always an error.

package/dist/agents/skills/gemini-cli/harness-mobile-patterns/SKILL.md CHANGED Viewed

@@ -311,6 +311,16 @@ Phase 4: VALIDATE
   Store submission ready: PASS
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                  | Reality                                                                                                                                                                                                                                                                                                  |
+| -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We request all permissions at launch to get them out of the way — users can deny them if they want."                            | App stores treat permissions-at-launch as a review red flag and users deny at much higher rates when there is no contextual explanation. Permissions requested at the moment they are needed, with a sentence explaining why, consistently achieve higher grant rates and reduce store rejection risk.   |
+| "Universal Links are optional — the URL scheme fallback works fine for deep linking."                                            | URL scheme fallbacks (`myapp://`) can be claimed by any installed app on the device. A malicious or coincidentally named app can intercept links intended for yours. Universal Links with verified `apple-app-site-association` files are cryptographically bound to your domain and cannot be hijacked. |
+| "The push notification handler works in foreground and background — we can handle the terminated state separately after launch." | Users often first interact with an app by tapping a push notification when the app is terminated. The cold-start tap handler is commonly the first impression. Shipping without it means a class of users experiences a broken entry point from day one.                                                 |
+| "The staging configuration is slightly different but we'll remember to change it before the App Store build."                    | "Remember to change it" is not a process. Staging URLs, debug API keys, and sandbox APNs environments in production builds have shipped before and will again. Separate build configurations and environment-specific entitlement files are the only reliable mitigation.                                |
+| "The privacy manifest requirement is new — we'll add it in the next release after the store flags it."                           | Apple has enforced PrivacyInfo.xcprivacy requirements for new submissions and updates since May 2024. Submitting without it results in rejection, which blocks the entire release. Adding it retroactively under rejection pressure is strictly more costly than adding it now.                          |
 ## Gates
 - **No missing permission usage descriptions.** Every permission requested in code must have a corresponding usage description in the platform manifest. Missing descriptions cause automatic App Store rejection on iOS and are a best practice requirement on Android.

package/dist/agents/skills/gemini-cli/harness-mutation-test/SKILL.md CHANGED Viewed

@@ -236,6 +236,15 @@ mvn org.pitest:pitest-maven:mutationCoverage
 # Report generated at target/pit-reports/index.html
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                | Why It Is Wrong                                                                                                      |
+| ---------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
+| "We have 80% line coverage, so test quality is already good"                                   | Line coverage measures execution, not verification. Mutation testing reveals missing assertions and weak assertions. |
+| "The survived mutants are in non-critical utility code, so we can ignore them"                 | Every survived mutant must be either addressed with a test or explicitly justified as an equivalent mutant.          |
+| "I will write a test that targets the specific mutation to kill it"                            | No gaming the mutation score. Every new test must test a meaningful behavior, not just kill a specific mutant.       |
+| "The test suite has some failures, but we can still run mutation testing to see what we learn" | No mutation testing against a failing test suite. Mutations against broken tests produce garbage results.            |
 ## Gates
 - **No mutation testing against a failing test suite.** All tests must pass before mutants are generated. Running mutations against broken tests produces garbage results. Fix the tests first.

package/dist/agents/skills/gemini-cli/harness-observability/SKILL.md CHANGED Viewed

@@ -268,6 +268,16 @@ Phase 4: VALIDATE
   Result: WARN -- 3 instrumentation gaps, alerting needs SLO alignment
 ```
+## Rationalizations to Reject
+| Rationalization                                                                        | Reality                                                                                                                                                                                                                                                                                                     |
+| -------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We can see what's happening in CloudWatch logs — we don't need structured logging"    | Unstructured log lines cannot be queried, aggregated, or correlated across services. When an incident spans three services, searching for a request ID across unstructured logs is manual forensics. Structured logging is not a nicety — it is the foundation for incident response.                       |
+| "We'll add alerting once we've seen a few incidents and know what to alert on"         | The first incident is the worst time to define alerting. SLO-based burn rate alerts can be defined from traffic patterns before any incidents occur. Waiting for incidents to define thresholds means every early failure goes undetected.                                                                  |
+| "User ID is a useful label for the latency metric — it helps us debug per-user issues" | User ID as a metric label creates one time series per user, which at 100,000 users means 100,000 label combinations. High-cardinality labels exhaust metric storage, cause query timeouts, and make the entire metrics system unstable. Use logs for per-user debugging; use metrics for aggregate signals. |
+| "The tracing library is initialized, so we have distributed tracing"                   | Initializing the library creates root spans but does not propagate context across HTTP boundaries, instrument database calls, or connect traces to logs. Trace initialization without verified end-to-end propagation produces disconnected, useless traces.                                                |
+| "We have alerts — they're just not linked to runbooks yet"                             | An alert that fires at 3am without a runbook link requires the on-call engineer to start debugging from scratch. The absence of a runbook is not a documentation gap; it is a mean-time-to-recover multiplier.                                                                                              |
 ## Gates
 - **No sensitive data in logs.** If PII, credentials, or tokens are detected in log output, it is a blocking finding. The logging configuration must sanitize or redact sensitive fields before any other improvements are made.

package/dist/agents/skills/gemini-cli/harness-onboarding/SKILL.md CHANGED Viewed

@@ -23,7 +23,7 @@
    - Constraints and forbidden patterns
    - Any special instructions or warnings
-2. **Read `harness.yaml`.** Extract:
+2. **Read `harness.config.json`.** Extract:
    - Project name and stack
    - Adoption level (basic, intermediate, advanced)
    - Layer definitions and their directory mappings
@@ -48,7 +48,7 @@
 2. **Map the architecture.** Walk the directory structure and identify:
    - Top-level organization pattern (monorepo, single package, workspace)
    - Source code location and entry points
-   - Layer boundaries (from `harness.yaml` and actual directory structure)
+   - Layer boundaries (from `harness.config.json` and actual directory structure)
    - Shared utilities or common modules
    - Configuration files and their purposes
@@ -61,7 +61,7 @@
    - Code formatting (detect from config files: `.prettierrc`, `.eslintrc`, `biome.json`)
 4. **Map the constraints.** Identify what is restricted:
-   - Forbidden imports (from `harness.yaml` dependency constraints)
+   - Forbidden imports (from `harness.config.json` dependency constraints)
    - Layer boundary rules (which layers can import from which)
    - Linting rules that encode architectural decisions
    - Any constraints documented in `AGENTS.md` that are not yet automated
@@ -95,8 +95,8 @@ Graph queries produce a complete architecture map in seconds, including transiti
 ### Phase 3: ORIENT — Identify Adoption Level and Maturity
-1. **Confirm the adoption level** matches what `harness.yaml` declares:
-   - Basic: `AGENTS.md` and `harness.yaml` exist but no layers or constraints
+1. **Confirm the adoption level** matches what `harness.config.json` declares:
+   - Basic: `AGENTS.md` and `harness.config.json` exist but no layers or constraints
    - Intermediate: Layers defined, dependency constraints enforced, at least one custom skill
    - Advanced: Personas, state management, learnings, CI integration
@@ -184,21 +184,29 @@ Graph queries produce a complete architecture map in seconds, including transiti
 - **`harness check-deps`** — Run to verify dependency constraints are passing, which confirms layer boundaries are respected.
 - **`harness state show`** — View current state to understand where the last session left off.
 - **`AGENTS.md`** — Primary source of project context and agent instructions.
-- **`harness.yaml`** — Source of structural configuration (layers, constraints, skills).
+- **`harness.config.json`** — Source of structural configuration (layers, constraints, skills).
 - **`.harness/learnings.md`** — Historical context and institutional knowledge.
 ## Success Criteria
-- All four configuration sources were read (`AGENTS.md`, `harness.yaml`, `.harness/learnings.md`, `.harness/state.json`)
+- All four configuration sources were read (`AGENTS.md`, `harness.config.json`, `.harness/learnings.md`, `.harness/state.json`)
 - Technology stack is accurately identified (language, framework, test runner, build tool)
 - Architecture is mapped with correct layer boundaries and dependency directions
 - Conventions are identified from actual code patterns, not assumed
-- Constraints are enumerated from both `harness.yaml` and `AGENTS.md`
+- Constraints are enumerated from both `harness.config.json` and `AGENTS.md`
 - Adoption level is confirmed (not just declared — validated)
 - A structured orientation summary is produced with all sections filled
 - The "Getting Started" section is actionable and tailored to the audience
 - `harness validate` was run and results are reported
+## Rationalizations to Reject
+| Rationalization                                                                                                            | Reality                                                                                                                                                                                          |
+| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "I can skip reading .harness/learnings.md since it is just historical notes"                                               | Learnings contain hard-won insights from previous sessions -- decisions made, gotchas discovered, patterns that worked or failed. Skipping them means repeating mistakes already diagnosed.      |
+| "The harness.config.json says intermediate, so I can report that without validation"                                       | Declared adoption level must be confirmed, not assumed. A project that declares intermediate but fails harness validate is not truly intermediate.                                               |
+| "I will map the architecture by reading the directory names since that is faster than checking conventions in actual code" | Conventions must be identified from actual code patterns, not assumed from directory structure. File naming, import style, and error handling can only be verified by reading real source files. |
 ## Examples
 ### Example: Onboarding to an Intermediate TypeScript Project
@@ -211,7 +219,7 @@ Read AGENTS.md:
   - Stack: TypeScript, Express, Vitest, PostgreSQL
   - Conventions: zod validation, repository pattern, kebab-case files
-Read harness.yaml:
+Read harness.config.json:
   - Level: intermediate
   - Layers: presentation (src/routes/), business (src/services/), data (src/repositories/)
   - Constraints: presentation → business OK, business → data OK, data → presentation FORBIDDEN
@@ -258,7 +266,7 @@ Produce orientation with all sections. Getting Started for this context:
 ```
 Read AGENTS.md — exists, minimal content
-Read harness.yaml — level: basic, no layers defined
+Read harness.config.json — level: basic, no layers defined
 No .harness/learnings.md
 No .harness/state.json
 ```

package/dist/agents/skills/gemini-cli/harness-parallel-agents/SKILL.md CHANGED Viewed

@@ -159,6 +159,15 @@ For each independent task, write a focused agent brief:
 - `harness validate` passes after integration
 - No agent modified files outside its declared scope
+## Rationalizations to Reject
+| Rationalization                                                                              | Why It Is Wrong                                                                                                                                |
+| -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| "These two tasks touch different functions in the same file, so they are independent enough" | If both tasks write to the same file, they are NOT independent. Even different functions in the same file creates merge conflicts.             |
+| "I verified independence manually -- no need to run check_task_independence"                 | Manual verification misses transitive dependency overlap. check_task_independence with graph-expanded analysis catches transitive conflicts.   |
+| "There are only 2 independent tasks, but parallelism would save time"                        | NOT when there are fewer than 3 independent tasks. Coordination overhead outweighs parallelism benefit for 2 tasks.                            |
+| "Each agent's tests pass, so integration is fine"                                            | Step 4 requires running the FULL test suite after integration. Parallel changes can cause integration failures that individual test runs miss. |
 ## Examples
 ### Example: Parallel Implementation of Three Independent Services

package/dist/agents/skills/gemini-cli/harness-perf/SKILL.md CHANGED Viewed

@@ -187,6 +187,17 @@ This phase runs only when `.bench.ts` files exist in the project. If none are fo
 - Gate decision is recorded in state
 - `harness validate` passes after enforcement
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                               | Why It Is Wrong                                                                                                                                                         |
+| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The cyclomatic complexity is 16 but the function is straightforward, so I can override the Tier 1 threshold" | Tier 1 violations are non-negotiable blockers. No merge with Tier 1 performance violations. If a threshold needs adjustment, reconfigure with documented justification. |
+| "The benchmark regression is only 6% and it is probably just noise"                                           | The noise margin (default 3%) is applied before flagging. A 6% regression on a perf-critical path exceeds the Tier 1 threshold even after noise consideration.          |
+| "The working tree has a small uncommitted change but it should not affect benchmark results"                  | No running benchmarks with a dirty working tree. Uncommitted changes invalidate benchmark results.                                                                      |
+| "I will update the baselines to match the new performance numbers rather than fixing the regression"          | Baselines must come from fresh runs against committed code. Silently moving the goalposts defeats the purpose of performance gates.                                     |
 ## Examples
 ### Example: PR with High Complexity Function

package/dist/agents/skills/gemini-cli/harness-perf-tdd/SKILL.md CHANGED Viewed

@@ -235,6 +235,16 @@ harness check-perf — complexity reduced from 12 to 8 (improvement)
 harness perf baselines update — new baseline saved
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                           | Reality                                                                                                                                                                                                                                                   |
+| --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The correctness test is green, I'll add the benchmark later when we know performance is an issue."       | The benchmark is not optional — it is the mechanism that defines "performance issue." Without a baseline captured at implementation time, you have nothing to compare against when a regression appears months later. Later never comes.                  |
+| "I'll skip the REFACTOR phase since the spec doesn't mention performance requirements."                   | The spec not mentioning a requirement means there is no user-facing SLO, not that performance is irrelevant. The benchmark still captures the baseline that future work must not regress from. Phase 3 is optional; the benchmark file is not.            |
+| "The benchmark results vary too much between runs to be meaningful — I'll just omit it."                  | Variance is a signal, not a reason to skip. High variance means the benchmark needs warmup iterations, more samples, or isolation from I/O. Fix the benchmark, do not delete it. An absent benchmark offers zero protection against regressions.          |
+| "This function is only called during startup, so its performance doesn't matter at runtime."              | Startup performance determines deployment speed, lambda cold-start latency, and test suite duration. "Not in the hot path at runtime" does not mean performance is free to ignore. Measure it so the baseline exists if startup behavior changes.         |
+| "We already have an integration test that covers this — writing a separate benchmark would be redundant." | Integration tests verify correctness under realistic conditions. Benchmarks measure isolated performance with precise input control. An integration test that passes in 2 seconds tells you nothing about whether the function itself takes 1ms or 800ms. |
 ## Gates
 - **No code before test AND benchmark.** Both must exist before implementation begins.

package/dist/agents/skills/gemini-cli/harness-planning/SKILL.md CHANGED Viewed

@@ -468,6 +468,16 @@ When `docs/changes/` exists in the project, produce `docs/changes/<feature>/delt
 - When `rigorLevel` is `standard` and task count < 8, the skeleton is skipped
 - The skeleton format is lightweight (~200 tokens): numbered groups with task count and time estimates
+## Rationalizations to Reject
+| Rationalization                                                                                               | Reality                                                                                                                                                                 |
+| ------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The task is conceptually clear so I do not need to include exact code in the plan"                           | Every task must have exact file paths, exact code, and exact commands. If you cannot write the code in the plan, you do not understand the task well enough to plan it. |
+| "This task touches 5 files but it is logically one unit of work, so splitting it would add overhead"          | Tasks touching more than 3 files must be split. The overhead of splitting is far less than the cost of a failed oversized task.                                         |
+| "Tests for this task can be added in a follow-up task since the implementation is straightforward"            | No skipping TDD in tasks. Every code-producing task must start with writing a test. "Add tests later" is explicitly forbidden.                                          |
+| "The spec does not cover this edge case, but I can fill in the gap during planning"                           | When the spec is missing information, do not fill in the gaps yourself. Escalate. Filling gaps silently creates undocumented design decisions that no one reviewed.     |
+| "I discovered we need an additional file during decomposition, but updating the file map is just bookkeeping" | The file map must be complete. Every file that will be created or modified must appear in the file map before task decomposition.                                       |
 ## Examples
 ### Example: Planning a User Notification Feature

package/dist/agents/skills/gemini-cli/harness-pre-commit-review/SKILL.md CHANGED Viewed

@@ -284,6 +284,15 @@ fi
 - [ ] AI review focused on high-signal issues only (no style nits)
 - [ ] Report follows the structured format exactly
+## Rationalizations to Reject
+| Rationalization                                                               | Why It Is Wrong                                                                                                                                               |
+| ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The lint errors are just warnings, so I can proceed to AI review"            | The gate is absolute: any mechanical check failure means STOP. AI review does not run until lint, typecheck, and tests all pass.                              |
+| "This is a docs-only change but let me run AI review anyway for thoroughness" | The fast path is mandatory. If only docs/config files changed, AI review is skipped. Running it anyway wastes tokens.                                         |
+| "The AI found a style issue, so I should block the commit"                    | AI review observations are advisory only. Only mechanical check failures block the commit.                                                                    |
+| "I will skip the security scan since this is an internal endpoint"            | Phase 3 runs the security scanner against all staged source files regardless of exposure. Hardcoded secrets and injection are blocking even in internal code. |
 ## Examples
 ### Example: Clean Commit

package/dist/agents/skills/gemini-cli/harness-product-spec/SKILL.md CHANGED Viewed

@@ -197,6 +197,16 @@
 - Output format matches existing project conventions when they exist
 - Generated PRD is saved to the correct directory with consistent naming
+## Rationalizations to Reject
+| Rationalization                                                                                    | Why It Is Wrong                                                                                                                                        |
+| -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The feature request is clear enough -- I can skip the ambiguity check and start writing stories"  | The gate: no generating specs from ambiguous input without clarification. Missing actors or undefined triggers lead to untestable acceptance criteria. |
+| "This acceptance criterion is understood by the team, so it does not need to be formally testable" | No untestable acceptance criteria is a hard gate. Every criterion must be verifiable by an automated test or specific manual procedure.                |
+| "The happy path scenarios are enough -- edge cases are unlikely"                                   | The skill requires at least one unwanted-behavior criterion for every user-facing action. Edge cases are where production bugs live.                   |
+| "The existing PRD is outdated, so I will just replace it with a fresh one"                         | No overwriting existing specs is a gate. Present the diff rather than replacing the file.                                                              |
+| "We can figure out the success metrics later during implementation"                                | Every success metric must be measurable, time-bound, and specific at spec time.                                                                        |
 ## Examples
 ### Example: GitHub Issue to PRD for Team Notifications

package/dist/agents/skills/gemini-cli/harness-property-test/SKILL.md CHANGED Viewed

@@ -266,6 +266,16 @@ def test_sort_handles_floats(xs):
         assert result[i] <= result[i + 1]
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                | Reality                                                                                                                                                                                                                                                                                                                                |
+| ------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We already have example-based tests that cover the edge cases — property tests would just be redundant."                      | Example-based tests cover the cases the author thought of. Property tests cover the cases they did not. The entire value of generative testing is that it explores regions of the input space that human intuition misses — off-by-one errors, Unicode combining characters, signed integer overflow at boundaries.                    |
+| "The generator keeps producing rejected inputs, so I'll just filter more aggressively to make the test pass faster."           | Heavy `filter` usage is a symptom of a broken generator, not a solution. Each rejected sample wastes an iteration, and `filter` destroys the shrinking chain, leaving you with an unhelpful counterexample when a bug is found. Rewrite the generator using `map` and `flatMap` to construct valid inputs directly.                    |
+| "The counterexample is too strange to be a real-world case — I'll just increase the iteration count so it appears less often." | A shrunk counterexample that triggers a property failure is a real bug by definition. "Unlikely in practice" is not a property of correctness — the question is whether the invariant holds. If the counterexample is a valid input the function might receive, fix the function. If it is not a valid input, constrain the generator. |
+| "This function has too many invariants to specify — I'll just skip property testing and trust the unit tests."                 | Complex functions with many invariants are exactly the functions most in need of property testing. High complexity means a larger bug-hiding surface. Start with the most important invariants (no-crash, round-trip, idempotence) rather than attempting to encode all properties at once.                                            |
+| "Property tests are too slow — they'll block CI for 10 minutes."                                                               | Run 100 iterations on PR, 10,000 iterations nightly. The CI time argument justifies reducing iteration count, never eliminating property tests entirely. A suite that runs 0 property tests found 0 edge cases.                                                                                                                        |
 ## Gates
 - **No property tests without shrinking.** If the framework's automatic shrinking is disabled or the generator uses patterns that break shrinking (excessive `filter`), counterexamples will be unhelpfully large. Fix the generator to support shrinking.

package/dist/agents/skills/gemini-cli/harness-refactoring/SKILL.md CHANGED Viewed

@@ -134,6 +134,15 @@ Skipping this step means subsequent graph queries (impact analysis, dependency h
 - No behavioral changes were introduced (the test suite is the proof)
 - No dead code was left behind (run `harness cleanup` to verify)
+## Rationalizations to Reject
+| Rationalization                                                                                                     | Reality                                                                                                                                      |
+| ------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The tests are mostly passing, so I can start refactoring and fix the remaining failures as I go"                   | All tests must pass BEFORE refactoring starts. If tests are not green before you start, you are not refactoring -- you are debugging.        |
+| "This refactoring changes a small amount of behavior, but it is a clear improvement"                                | Refactoring must not change behavior. The test suite is the proof. If the refactoring requires changing tests, you may be changing behavior. |
+| "I will make several changes at once and run tests at the end since each change is small"                           | Tests must run after EVERY single change. If a test breaks, you must undo the LAST change immediately.                                       |
+| "The refactoring did not produce a measurable improvement, but the code is different so it must be somewhat better" | If the refactoring introduced no measurable improvement, revert the entire sequence. Refactoring for its own sake is churn.                  |
 ## Examples
 ### Example: Moving business logic out of a UI component

package/dist/agents/skills/gemini-cli/harness-release-readiness/SKILL.md CHANGED Viewed

@@ -537,6 +537,17 @@ This framing is informational — it does not block anything. It gives the team
 8. Monorepo support: each package is audited independently with per-package results in the report
 9. `harness validate` passes after the skill's SKILL.md and skill.yaml are written
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                           | Why It Is Wrong                                                                                                            |
+| ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| "The MAINTAIN phase takes too long, so I will skip dispatching the 4 maintenance agents"  | No skipping the MAINTAIN phase. Maintenance checks catch issues that release-specific checks miss.                         |
+| "This auto-fix is obviously correct, so I can apply it without prompting the user"        | No auto-fix without prompting. Every fix must be presented to the human before being applied.                              |
+| "Most checks pass and only a few warnings remain, so the release is ready"                | A "mostly passing" report is not a passing report. The result is PASS only when zero failures exist across all categories. |
+| "The previous run found these issues and I fixed them, so I can trust the cached results" | Session resumption requires re-running all checks. Code may have changed since the last run.                               |
 ## Examples
 ### Example: First Run on a Monorepo with Gaps

package/dist/agents/skills/gemini-cli/harness-resilience/SKILL.md CHANGED Viewed

@@ -240,6 +240,16 @@ Phase 4: VALIDATE
     Redis fallback serves from LRU when Redis is down
 ```
+## Rationalizations to Reject
+| Rationalization                                                                     | Reality                                                                                                                                                                                                                                                                                                                        |
+| ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "That third-party API has 99.99% uptime — we don't need a circuit breaker"          | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. |
+| "We have retry logic, so failures are handled"                                      | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives.                                                                                       |
+| "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production.                                                       |
+| "Our database connection pool is 100 connections — that's plenty"                   | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required.                                                                                      |
+| "The service is internal — it doesn't need rate limiting"                           | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations.                                             |
 ## Gates
 - **No retry on non-idempotent operations without idempotency keys.** Retrying a POST or DELETE that lacks an idempotency mechanism can cause data duplication or data loss. This is a blocking finding. The operation must be made idempotent before retry logic is added.

package/dist/agents/skills/gemini-cli/harness-roadmap/SKILL.md CHANGED Viewed

@@ -42,7 +42,7 @@ If the human has not seen and approved the milestone groupings and feature list,
    - Has spec + plan but no implementation -> `planned`
    - Has spec but no plan -> `backlog`
    - Has plan but no spec -> `planned` (unusual, flag for human review)
-6. Detect project name from `harness.yaml` `project` field, or `package.json` `name` field, or directory name as fallback.
+6. Detect project name from `harness.config.json` `project` field, or `package.json` `name` field, or directory name as fallback.
 Present scan summary:
@@ -457,6 +457,15 @@ Choice?
 19. `--query` filters features by status or milestone and displays results with milestone context
 20. `--query` errors gracefully when no roadmap exists, directing the user to `--create`
+## Rationalizations to Reject
+| Rationalization                                                                                                   | Reality                                                                                                                             |
+| ----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| "The feature list looks correct, so I can skip the PROPOSE phase and write the roadmap directly"                  | The Iron Law: never write docs/roadmap.md without the human confirming the proposed structure first.                                |
+| "This sync detected a status change and the inference is clearly correct, so I can apply it without confirmation" | The sync PROPOSE phase requires presenting proposed changes and waiting for human confirmation. The human-always-wins rule applies. |
+| "The existing roadmap is outdated, so I will recreate it with --create to get a fresh start"                      | No overwriting an existing roadmap without explicit user consent. Silent overwrites destroy prior manual edits and status tracking. |
+| "There is no roadmap yet but the user asked me to add a feature, so I will create one as a side effect of --add"  | When the roadmap does not exist, --add must error with a clear message directing the user to --create.                              |
 ## Examples
 ### Example: `--create` -- Bootstrap a Roadmap from Existing Artifacts

package/dist/agents/skills/gemini-cli/harness-roadmap-pilot/SKILL.md CHANGED Viewed

@@ -150,6 +150,14 @@ Proceed with Feature A? (y/n/pick another)
 7. Transition routes to brainstorming (no spec) or autopilot (spec exists)
 8. `harness validate` passes after all changes
+## Rationalizations to Reject
+| Rationalization                                                                                                         | Reality                                                                                                                                 |
+| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| "The top-scored candidate is obviously correct, so I can assign it without asking the human"                            | The Iron Law: never assign or transition without the human confirming the recommendation first.                                         |
+| "Affinity data is not available so the scoring is degraded -- I should just pick the first planned item"                | Proceed without affinity scoring by zeroing out the affinity weight. Position and dependents signals still produce meaningful rankings. |
+| "The feature has no spec, but I can skip brainstorming and jump straight to planning since the summary is clear enough" | No spec routes to brainstorming, spec exists routes to autopilot. A one-line roadmap summary is not a spec.                             |
 ## Examples
 ### Example: Pick Next Item from a Multi-Milestone Roadmap

package/dist/agents/skills/gemini-cli/harness-secrets/SKILL.md CHANGED Viewed

@@ -278,6 +278,16 @@ Phase 4: VALIDATE
   Result: FAIL -- rotation required before deployment, history rewrite recommended
 ```
+## Rationalizations to Reject
+| Rationalization                                             | Reality                                                                                                                                                                                                                                              |
+| ----------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "That key is read-only so it's not a big deal if it leaks"  | Read-only credentials still enable data exfiltration, reconnaissance, and discovery of other vulnerabilities. A leaked read-only database credential exposes every row in the database. Scope does not eliminate risk.                               |
+| "We removed it from the file — it's cleaned up now"         | Removing a secret from the current tree does not remove it from git history. Anyone with a clone of the repository can recover the secret with `git log -p`. Rotation is required regardless of file deletion.                                       |
+| "That's a test environment key, not production"             | Test environment credentials are frequently reused, shared informally, and rotated less often. Leaked test keys also reveal credential patterns and naming conventions that help attackers guess production secrets.                                 |
+| "It's in a private repo so only our team can see it"        | Private repos are accessed by CI/CD systems, third-party integrations, contractors, and former employees. Repository access controls are not a substitute for secret externalization. Breaches routinely originate from compromised internal access. |
+| "We'll move it to an environment variable before we deploy" | Intent does not prevent exposure. The secret is in the codebase now and may already be in commit history, CI logs, or developer machine caches. Remediation must happen at the moment of detection, not at deployment time.                          |
 ## Gates
 - **No CRITICAL findings may remain unaddressed.** Production credentials exposed in source code are blocking. Execution halts until the credential is rotated and the code is remediated.

package/dist/agents/skills/gemini-cli/harness-security-review/SKILL.md CHANGED Viewed

@@ -174,6 +174,16 @@ Threat Model:
 - **`query_graph` / `get_relationships`** — Used in threat modeling phase for data flow tracing
 - **`get_impact`** — Understand blast radius of security-sensitive changes
+## Rationalizations to Reject
+| Rationalization                                                                        | Reality                                                                                                                                                                                                                                                                                                  |
+| -------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The scanner didn't flag it so it must be fine"                                        | Mechanical scanners catch pattern-level issues. They cannot trace user input across multiple function calls to a dangerous sink, detect authorization logic flaws, or evaluate whether a fallback chain fails open. The AI review phase exists precisely because scanners miss semantic vulnerabilities. |
+| "This endpoint is behind authentication so we don't need to validate input"            | Authentication and input validation are orthogonal controls. Authenticated users can still send malicious payloads. Authenticated SQL injection, SSRF, and path traversal are well-documented attack patterns against internal-only endpoints.                                                           |
+| "The vulnerability requires knowing our internal schema to exploit"                    | Security through obscurity is not a control. Internal schema details leak through error messages, API responses, documentation, and employee turnover. Rate the vulnerability based on its impact assuming the attacker knows the system.                                                                |
+| "We'll add rate limiting and input validation later once the feature ships"            | Security controls added after deployment require re-testing and re-review. Shipping without them creates an exposure window and establishes technical debt that is systematically deprioritized once the feature is live.                                                                                |
+| "That's an OWASP theoretical risk — our app isn't targeted by sophisticated attackers" | OWASP findings are exploited by automated scanners, not just sophisticated attackers. Opportunistic bots continuously probe for SQL injection, XSS, and auth bypass. Unpatched OWASP Top 10 issues are routinely exploited within hours of exposure.                                                     |
 ## Gates
 - **Mechanical scanner must run before AI review.** The scanner catches what patterns can catch; AI reviews what remains.

package/dist/agents/skills/gemini-cli/harness-security-scan/SKILL.md CHANGED Viewed

@@ -94,21 +94,11 @@ These apply to ALL skills. If you catch yourself doing any of these, STOP.
 ## Rationalizations to Reject
-### Universal
-These reasoning patterns sound plausible but lead to bad outcomes. Reject them.
-- **"It's probably fine"** — "Probably" is not evidence. Verify before asserting.
-- **"This is best practice"** — Best practice in what context? Cite the source and
-  confirm it applies to this codebase.
-- **"We can fix it later"** — If it is worth flagging, it is worth documenting now
-  with a concrete follow-up plan.
-### Domain-Specific
-- **"No attacker would find this"** — Security by obscurity. If the code is wrong, flag it regardless of discoverability.
-- **"We're behind a firewall"** — Network boundaries change. Code should be secure at every layer regardless of deployment topology.
-- **"The framework handles this for us"** — Verify the framework's actual behavior. Misuse of a secure framework is still insecure.
+| Rationalization                     | Reality                                                                                            |
+| ----------------------------------- | -------------------------------------------------------------------------------------------------- |
+| "No attacker would find this"       | Security by obscurity. If the code is wrong, flag it regardless of discoverability.                |
+| "We're behind a firewall"           | Network boundaries change. Code should be secure at every layer regardless of deployment topology. |
+| "The framework handles this for us" | Verify the framework's actual behavior. Misuse of a secure framework is still insecure.            |
 ## Escalation