npm - @harness-engineering/cli - Versions diffs - 1.23.0 → 1.23.2 - Mend

@harness-engineering/cli 1.23.0 → 1.23.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (423) hide show

package/dist/agents/skills/cursor/harness-design-web/SKILL.md CHANGED Viewed

@@ -341,6 +341,16 @@ defineProps<Props>();
 </style>
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                                | Reality                                                                                                                                                                                                                                                |
+| ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The tokens file doesn't exist yet, but I know the brand colors — I'll hardcode them as a placeholder and note they should be replaced later." | Hardcoded values in generated output are the exact problem this skill exists to prevent. There is no placeholder exception. If `design-system/tokens.json` does not exist, instruct the user to run harness-design-system first and stop.              |
+| "The framework is obviously React — everything in this project is React. I don't need to run detection."                                       | Detection also identifies the CSS strategy (Tailwind vs CSS Modules vs CSS-in-JS), which determines how tokens map to code. Skipping detection produces components that may reference non-existent Tailwind classes or wrong theme paths.              |
+| "The user hasn't confirmed the scaffold plan, but the component structure is straightforward — I'll just generate it."                         | The scaffold plan confirmation is a gate. The user must see which tokens will be consumed and what the component structure will be before code is written. Generating first and explaining later inverts the review opportunity.                       |
+| "This component only uses one hardcoded hex value for a shadow — that's not really a design value, so I'll leave it."                          | Every color, font, and spacing value must reference a token. Shadows use color tokens. "Not really a design value" is not a category the verification phase recognizes. The VERIFY phase will flag it; the IMPLEMENT phase should not introduce it.    |
+| "The `@design-token` annotations are just comments — skipping them on a few components won't affect anything."                                 | These annotations are how `harness scan` creates `USES_TOKEN` edges in the knowledge graph. Missing annotations mean harness-impact-analysis cannot trace token changes to affected components. They are structural metadata, not decorative comments. |
 ## Gates
 These are hard stops. Violating any gate means the process has broken down.

package/dist/agents/skills/cursor/harness-diagnostics/SKILL.md CHANGED Viewed

@@ -232,6 +232,15 @@ This log accumulates over time and helps improve future classifications.
 - **Flaky test not isolated in 60 minutes:** The non-determinism source may be outside the codebase (infrastructure, external service). Escalate with your findings.
 - **Security vulnerability with large blast radius:** If the minimal fix requires changing more than 3 files, reclassify as Design and escalate.
+## Rationalizations to Reject
+| Rationalization                                                                           | Why It Is Wrong                                                                                                                         |
+| ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
+| "I can see the error is a type issue -- let me just fix it without formal classification" | The gate says classification must be explicit and written down before any fix attempt. Implicit classification skips the evidence step. |
+| "This looks like a Design issue, but I can probably fix it locally with a small change"   | Design category MUST escalate. Local fixes for architectural problems create more architectural problems.                               |
+| "I do not need to run tests before fixing -- I know what the baseline is"                 | Deterministic checks before AND after is a gate. Without a recorded baseline, you cannot prove the fix helped.                          |
+| "My first fix did not work, but I will try a different approach within the same category" | Reclassify, do not force. If the resolution strategy is not working, the classification is probably wrong.                              |
 ## Examples
 ### Example 1: Type Error in API Handler

package/dist/agents/skills/cursor/harness-docs-pipeline/SKILL.md CHANGED Viewed

@@ -379,6 +379,17 @@ while iteration < maxIterations:
 - PASS/WARN/FAIL report includes per-category breakdown and specific remaining findings
 - Drift fixes in FIX phase are excluded from AUDIT findings (no double-counting)
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                               | Why It Is Wrong                                                                                                                                   |
+| ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The drift finding is marked unsafe but the fix is obvious, so I will apply it silently"                      | Never apply a fix classified as unsafe without explicit user approval. The Iron Law: safe fixes are silent, unsafe fixes surface.                 |
+| "The convergence loop reduced findings from 8 to 6, but the remaining ones are hard -- I will keep iterating" | If a convergence iteration does not reduce the finding count, stop immediately. Continuing without progress wastes iterations.                    |
+| "I can write the drift detection logic directly instead of delegating to detect-doc-drift"                    | The pipeline delegates, never reimplements. Each sub-skill retains full standalone functionality.                                                 |
+| "The graph is not available so the pipeline results will be unreliable"                                       | The entire pipeline runs without a graph using static analysis fallbacks. Reduced accuracy is noted in the report, not used as an excuse to skip. |
 ## Examples
 ### Example: Full pipeline run with fixes

package/dist/agents/skills/cursor/harness-dx/SKILL.md CHANGED Viewed

@@ -261,6 +261,16 @@ Phase 4: VALIDATE
   DX Scorecard: B -> A (projected after applying changes)
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                                        | Reality                                                                                                                                                                                                                                                                    |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The README has an installation section but it only covers npm — yarn and pnpm users can figure it out. I'll mark installation as complete."           | Installation instructions must cover all package managers the project supports. If `yarn.lock` or `pnpm-lock.yaml` exists alongside `package-lock.json`, all three installers must be documented. Partial coverage is scored as partial, not complete.                     |
+| "This code example in the README uses the old `sdk.connect()` API — but it still parses syntactically, so it passes the syntax check."                 | Stale API references are broken examples regardless of syntax validity. A syntactically valid example that calls a renamed or removed function fails the freshness check and must be flagged as broken in the scorecard.                                                   |
+| "The API function's behavior is complex, but I can infer what it does from the name `parseAndValidate` — I'll write the docstring stub based on that." | Documentation must be derived from actual source code: type signatures, test files, and existing docs. Inferring behavior from function names produces fabricated documentation. Flag functions that cannot be documented from source as requiring developer-written docs. |
+| "The getting-started guide already exists in the wiki — it's not in the repo, but I'll mark the quickstart as present."                                | Documentation must be locatable from the repository root. A wiki link from the README satisfies the API reference link criterion only if the link is explicit. A guide that requires knowing where the wiki is does not meet the discoverability requirement.              |
+| "There are 18 undocumented exports — I'll generate all 18 JSDoc stubs and commit them without showing the user first."                                 | Scaffolded documentation must be presented for review before being written. Generated stubs may contain inaccurate parameter descriptions or wrong return type assumptions. Use `emit_interaction` to present scaffolded content and wait for approval.                    |
 ## Gates
 - **No scaffolding without human confirmation.** Generated documentation is always presented as a draft for review. Do not commit generated files automatically. Use `emit_interaction` to present scaffolded content and wait for approval.

package/dist/agents/skills/cursor/harness-e2e/SKILL.md CHANGED Viewed

@@ -230,6 +230,15 @@ describe('Checkout flow', () => {
 });
 ```
+## Rationalizations to Reject
+| Rationalization                                                                      | Why It Is Wrong                                                                                                                                                    |
+| ------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "Using CSS class selectors is faster than adding data-testid attributes"             | No CSS class selectors in page objects. .btn-primary breaks when the design system updates class names. Use data-testid, ARIA roles, and accessible labels.        |
+| "Adding a short waitForTimeout is easier than figuring out the right wait condition" | No arbitrary waits is a hard gate. waitForTimeout is a flakiness timebomb. Wait for specific conditions: network responses, DOM mutations, or URL changes.         |
+| "This test creates data through the UI because the API setup is complex"             | Test data must be created via API or fixtures, not through UI interactions. UI-based setup is slow, brittle, and conflates setup failures with assertion failures. |
+| "The test only fails sometimes in CI -- adding a retry will fix it"                  | Flaky tests block merge. Diagnose the root cause. Retries mask problems. After remediation, rerun 5 times to confirm stability.                                    |
 ## Gates
 - **No CSS class selectors in page objects.** If a locator uses `.btn-primary` or `[class*="header"]`, the test is brittle. Use `data-testid`, ARIA roles, or accessible labels. Rewrite before merging.

package/dist/agents/skills/cursor/harness-event-driven/SKILL.md CHANGED Viewed

@@ -265,6 +265,16 @@ PASS: Idempotency via sequence numbers on event store
 PASS: Read model rebuild procedure documented in ops runbook
 ```
+## Rationalizations to Reject
+| Rationalization                                                                               | Reality                                                                                                                                                                                                                                                                                                                                                    |
+| --------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "Our handlers are idempotent enough — we don't need a deduplication table"                    | "Idempotent enough" is not a guarantee. At-least-once delivery means the same message can arrive seconds, minutes, or hours apart. A handler that relies on approximate idempotency (e.g., checking a cache) will produce duplicate side effects when the deduplication window expires or the cache is flushed.                                            |
+| "We publish the event right after the database write — it's essentially the same transaction" | Two separate operations are not a transaction regardless of how close together they are. If the process crashes between the database write and the event publish, the write is committed but the event is never sent. Consumers will never see the state change. This is the dual-write problem and it requires the transactional outbox pattern to solve. |
+| "The dead-letter queue is configured but nobody monitors it"                                  | An unmonitored DLQ is a silent data loss queue. Failed messages accumulate with no alerting, no replay procedure, and no investigation. A DLQ without monitoring and a replay runbook is a place where business events go to die.                                                                                                                          |
+| "Saga compensation is complex — we'll handle failures with manual intervention"               | Manual intervention does not scale and is not available at 3am. A saga that partially completes without compensation leaves the system in a state that requires a human to reconstruct — which means it will not be reconstructed reliably. Every saga step that can fail must have a defined compensating action.                                         |
+| "We'll add event versioning when we need to change the schema"                                | Adding versioning to an event schema after consumers are deployed is a breaking change. Consumers expecting version 1 receive an unversioned event and have no way to detect that it is incompatible. Versioning must be in the envelope from the first event in production.                                                                               |
 ## Gates
 - **Every consumer must have a dead-letter queue.** No consumer may silently drop failed messages. WHERE a consumer is configured without a DLQ, THEN the skill must halt and require DLQ configuration before proceeding. Lost messages in production are unrecoverable.

package/dist/agents/skills/cursor/harness-execution/SKILL.md CHANGED Viewed

@@ -411,6 +411,15 @@ When this skill makes claims about task completion, test results, or code behavi
 - No improvisation: tasks were executed as written, or execution was stopped and the blocker was reported
 - All stopping conditions were respected (no guessing past blockers, no blind retries)
+## Rationalizations to Reject
+| Rationalization                                                                                                | Reality                                                                                                                                                   |
+| -------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The plan says to do X, but doing Y would be cleaner -- I will improvise"                                      | The Iron Law states: execute the plan as written. If the plan is wrong, stop and fix the plan. Improvising mid-execution introduces untested assumptions. |
+| "This task depends on Task 3 which I know is done, so I can skip verifying prerequisites"                      | Prerequisites must be verified mechanically, not from memory. Check that dependency tasks are marked complete in state and that referenced files exist.   |
+| "The checkpoint is just a confirmation step and the output looks correct, so I will auto-continue"             | Checkpoints are non-negotiable pause points. If a task has a checkpoint marker, execution must pause.                                                     |
+| "Harness validate passed on the previous task and nothing changed structurally, so I can skip it for this one" | Validation runs after every task with no exceptions. Each task may introduce subtle architectural drift that only harness validate catches.               |
 ## Examples
 ### Example: Executing a 5-Task Notification Plan

package/dist/agents/skills/cursor/harness-feature-flags/SKILL.md CHANGED Viewed

@@ -206,6 +206,16 @@
 - Rollout configuration is validated for active flags
 - Lifecycle policies are recommended with enforcement mechanisms
+## Rationalizations to Reject
+| Rationalization                                                            | Why It Is Wrong                                                                                                                        |
+| -------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| "This release flag has been at 100% for a while, but removing it is risky" | Release flags at 100% for more than 30 days are stale candidates. Every stale flag adds dead code branches and test matrix complexity. |
+| "We only need to test the flag-on path since that is the path we ship"     | No flags without test coverage for both paths. The flag-off path IS the fallback when the flag provider is unreachable.                |
+| "These two flags depend on each other, but they work fine together"        | No coupled flag dependencies is a blocking finding. Flags that require other flags creates combinatorial complexity.                   |
+| "Setting the flag default to true makes the rollout easier"                | Every flag must default to safe (feature disabled). A default of true means a provider outage enables the feature for everyone.        |
+| "We do not need a naming convention -- our flag count is small"            | Inconsistent naming becomes unmanageable as flag count grows. The skill flags inconsistency as a warning even at small scale.          |
 ## Examples
 ### Example: React SPA with LaunchDarkly

package/dist/agents/skills/cursor/harness-git-workflow/SKILL.md CHANGED Viewed

@@ -190,6 +190,16 @@ git branch -D <branch-name>
 - Worktree was cleaned up after finishing (unless keeping for continued work)
 - No stale worktree references remain after cleanup
+## Rationalizations to Reject
+| Rationalization                                                                                                                                       | Reality                                                                                                                                                                                                                                                        |
+| ----------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The tests are probably fine on the fresh branch — they were passing on main when I last checked. I'll skip baseline verification and start working." | Baseline verification is the condition that makes branch work trustworthy. A test failure discovered at finish time is ambiguous — it could be pre-existing or introduced by the work. Skipping baseline removes the only clean comparison point.              |
+| "The user said 'just merge it' — I'll merge without checking if the base branch has advanced since the worktree was created."                         | The pre-finish check for base branch divergence is mandatory before any finishing strategy. Merging without rebasing first can produce a merge that silently breaks tests that were passing on the branch but conflict with new commits on main.               |
+| "The worktree directory isn't gitignored, but it's inside a nested folder that's unlikely to be committed accidentally."                              | The `.gitignore` check is not about likelihood — it is about preventing accidental commits of worktree state that would corrupt the repository. If the worktree directory is not gitignored, add it before creating the worktree. No exceptions.               |
+| "The user chose to discard — I'll delete the branch and worktree immediately without showing the commits that will be lost."                          | The discard path requires showing the commit list from `git log main..HEAD --oneline` and receiving explicit confirmation before running `git worktree remove` and `git branch -D`. Work is being permanently deleted; the user must see what they are losing. |
+| "There's already a worktree for this branch at a different path — I'll create a second one since the user asked for a fresh setup."                   | Git does not allow two worktrees checked out to the same branch. Attempting to create a duplicate will fail. Instead, ask the user whether to use the existing worktree or create a new branch. Never assume a second worktree is the right answer.            |
 ## Examples
 ### Example: Setting Up a Worktree for a New Feature

package/dist/agents/skills/cursor/harness-hotspot-detector/SKILL.md CHANGED Viewed

@@ -128,6 +128,16 @@ Use `get_relationships` to check structural edges between co-change pairs.
 - Report follows the structured output format
 - All findings are backed by graph query evidence (with graph) or git log analysis (without graph)
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                                          | Why It Is Wrong                                                                                                                                         |
+| ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "High churn just means the file is actively developed, not that it is risky"                                             | High churn in shared utilities specifically equals high risk. A file with 45 commits that co-changes with 12 different files indicates hidden coupling. |
+| "The co-change pair is between two files in different modules, but they probably just happen to change at the same time" | Distant co-change pairs are flagged as suspicious precisely because they indicate hidden coupling.                                                      |
+| "No graph exists so the analysis will be too incomplete to be useful"                                                    | Git log provides ~90% of the data needed for hotspot detection. The fallback is the highest-completeness fallback across all graph-enhanced skills.     |
 ## Examples
 ### Example: Detecting Hotspots in a Growing Codebase

package/dist/agents/skills/cursor/harness-i18n/SKILL.md CHANGED Viewed

@@ -465,6 +465,16 @@ Remaining violations (require human judgment): 5
 - I18N-401: Missing key in es -- requires Spanish translation
 - I18N-402: Untranslated value in fr -- requires French translation
+## Rationalizations to Reject
+| Rationalization                                                                                                                           | Reality                                                                                                                                                                                                                                               |
+| ----------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "This string is the app's brand name — it's technically hardcoded but obviously shouldn't be translated. I'll skip flagging it."          | Brand names require explicit suppression via `// i18n-ignore` comment, not silent omission from the scan. Skipping without suppression means future scans have inconsistent results and the team has no record of the deliberate decision.            |
+| "The framework isn't in the knowledge base, but I can tell from context it's using i18next patterns — I'll apply i18next rules directly." | Unrecognized frameworks must fall back to generic detection rules, not assumed framework rules. Applying i18next-specific fix patterns to an unknown framework produces incorrect wrapping that breaks at runtime. Log the gap and use generic rules. |
+| "The project has `i18n.enabled: false` — I'll still flag errors for hardcoded strings since the team should know about them."             | Respecting `i18n.enabled: false` is a gate. The team made a configuration decision. In that state, run in discovery mode (info severity only). Escalating to errors overrides the team's explicit choice.                                             |
+| "I18N-402 untranslated values are just warnings — I'll skip reporting them to keep the report shorter."                                   | Untranslated values (target identical to source) are a distinct violation category with their own code. They indicate copy-paste during file creation without actual translation. Omitting them produces a misleadingly optimistic coverage report.   |
+| "The plural rules for this locale look complex — I'll just check for 'one' and 'other' forms like English and move on."                   | Plural rules are locale-specific and must be loaded from the locale profile. Arabic requires six categories; Polish requires four. Checking only English plural categories produces false-passing results for languages that require more forms.      |
 ## Gates
 These are hard stops. Violating any gate means the process has broken down.

package/dist/agents/skills/cursor/harness-i18n-process/SKILL.md CHANGED Viewed

@@ -369,6 +369,16 @@ Result:         BLOCKED -- i18n review not conducted for user-facing PR
 Action:         Run harness-i18n scan on changed files, address findings, then re-review
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                                     | Reality                                                                                                                                                                                                                                                                   |
+| --------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "I can see hardcoded strings in the component the user is discussing — I'll flag them now as part of the process review."                           | This skill operates on artifacts (specs, plans, review context), never on source code. Scanning source files is harness-i18n's responsibility. Running Grep on component files from within this skill violates the skill boundary, regardless of how convenient it seems. |
+| "The feature clearly has no user-facing strings — it's a background job. I'll skip injection entirely without checking."                            | The skill must assess whether injection is applicable before skipping. Background jobs can produce user-facing output via notifications, emails, and error responses. A deliberate "not applicable" decision requires reading the feature description, not assuming.      |
+| "The team is in prompt mode and has dismissed the suggestion twice — I'll escalate to gate mode enforcement to make sure they take i18n seriously." | Escalating to gate mode is a configuration decision the team must make explicitly. Prompt mode is always dismissible. Unilaterally enforcing gate-mode behavior overrides a team's deliberate choice and violates the skill's core operating contract.                    |
+| "The plan has one task called 'polish and cleanup' — that probably includes i18n work. I'll mark the i18n check as passing."                        | In gate mode, i18n task presence must be verified by keyword match (i18n, translation, locale, localization, l10n), not inferred from vague task names. Ambiguous tasks must be flagged as missing, not assumed to cover the requirement.                                 |
+| "The spec mentions 'multi-language support' in passing — that counts as addressing i18n, so I won't require a dedicated section."                   | A passing mention is not an i18n section. The validation check requires the spec to identify which strings are user-facing, which locales are affected, and any formatting requirements. A vague reference satisfies none of these.                                       |
 ## Gates
 These are hard stops. Violating any gate means the process has broken down.

package/dist/agents/skills/cursor/harness-i18n-workflow/SKILL.md CHANGED Viewed

@@ -493,6 +493,16 @@ emails.welcome.greeting           -> "Hello {name}, welcome aboard!"
 Approve to continue scaffolding, or provide corrections.
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                                  | Reality                                                                                                                                                                                                                                                                |
+| ------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The user already told me they want Spanish and French — I can skip the configuration phase and go straight to scaffolding."                     | The configuration phase writes the `i18n` block to `harness.config.json`. Without it, subsequent runs of harness-i18n have no enabled flag, no strictness level, and no locale list to work against. Verbal confirmation does not substitute for written config.       |
+| "In retrofit mode, the key naming is straightforward — I'll apply the generated key catalog directly without showing it to the user for review." | The retrofit key catalog checkpoint is a hard gate. Key names become permanent identifiers that translation teams, TMS tools, and source code will reference for years. The user must review and approve them before any files are written.                            |
+| "The pseudo-locale transformation for this string with `{name}` is obvious — I'll just wrap the entire string including the placeholder."        | ICU MessageFormat placeholders must be preserved exactly. Transforming `{name}` to `{ñàmë}` breaks the interpolation at runtime. The pseudo-locale algorithm must detect and skip all placeholder syntax before applying accent and expansion transforms.              |
+| "These target locale files already exist from a previous run — I'll overwrite them with the new extraction output to keep things clean."         | Existing target locale translations must never be overwritten. A key with a translated (non-empty, non-source-identical) value in a target locale represents real translation work. Overwriting it destroys that work silently. Always preserve existing translations. |
+| "We found 120 strings in retrofit mode — I'll just run the full extraction without the audit phase since we clearly need everything extracted."  | The retrofit audit results are what tell the user how much effort the extraction requires and let them prioritize high-traffic flows. Skipping the audit and going straight to extraction removes the user's ability to scope the work before it happens.              |
 ## Gates
 These are hard stops. Violating any gate means the process has broken down.

package/dist/agents/skills/cursor/harness-impact-analysis/SKILL.md CHANGED Viewed

@@ -151,6 +151,16 @@ When no graph is available, use static analysis to approximate impact:
 - Report follows the structured output format
 - All findings are backed by graph query evidence (with graph) or systematic static analysis (without graph)
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                    | Why It Is Wrong                                                                                                                          |
+| -------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| "The change is small so the blast radius must be low -- I can skip the transitive dependent check" | Small changes to shared utilities can have outsized blast radius. A one-line change to auth.ts can affect 23 transitive dependents.      |
+| "The graph is a few commits behind but it is close enough for this analysis"                       | If the graph is more than 2 commits behind, the skill requires a refresh before proceeding. Recent commits may have added new consumers. |
+| "No graph exists so I cannot produce a useful impact analysis"                                     | The fallback strategy using import parsing and naming conventions achieves ~70% completeness. Missing the graph does not mean stopping.  |
 ## Examples
 ### Example: Analyzing a Change to auth.ts

package/dist/agents/skills/cursor/harness-incident-response/SKILL.md CHANGED Viewed

@@ -208,6 +208,16 @@ Phase 4: IMPROVE
     4. [P2] Create secret rotation runbook for all services (owner: @sre)
 ```
+## Rationalizations to Reject
+| Rationalization                                                                         | Reality                                                                                                                                                                                                                                                                           |
+| --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The root cause was human error — someone pushed a bad config"                          | Human error is a symptom, not a root cause. The root cause is the system that allowed a bad config to reach production undetected. A postmortem that stops at "human error" prevents no future incidents because it identifies no systemic fix.                                   |
+| "We know what happened — we don't need to write a full postmortem for a minor incident" | The decision about what is "minor" is made under the stress of recovery, not under calm analysis. Contributing factors and near-misses that look minor in the moment are frequently the root cause of the next major incident. Document while the context is fresh.               |
+| "The action items are in Slack — we don't need to track them formally"                  | Action items not tracked in a formal system with owners and due dates are not completed. Slack messages are buried within hours. The improvement phase of an incident exists only if its outputs are tracked to completion.                                                       |
+| "We don't have SLOs yet so we can't calculate error budget impact"                      | The absence of SLOs is itself a finding. Without SLOs, there is no objective basis for deciding whether reliability is acceptable. The incident is the forcing function to establish baseline SLOs. Document this gap as a P0 action item.                                        |
+| "The incident was caused by a third-party outage — nothing we could have done"          | Third-party outages expose missing circuit breakers, absent fallbacks, and insufficient multi-region routing. The postmortem should document why the third-party outage caused a customer-visible incident and what resilience improvements would have isolated the blast radius. |
 ## Gates
 - **No postmortem without a root cause statement.** A postmortem that says "cause unknown" is incomplete. If the root cause cannot be determined, the postmortem must document what was investigated, what was ruled out, and what additional data is needed. Do not close the investigation.

package/dist/agents/skills/cursor/harness-infrastructure-as-code/SKILL.md CHANGED Viewed

@@ -264,6 +264,16 @@ Phase 4: VALIDATE
   Result: WARN -- 2 security improvements needed
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                | Reality                                                                                                                                                                                                                                                                                                  |
+| ---------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We store state locally because it's just a dev environment"                                   | Local state is not shared between team members. Two developers running `terraform apply` against the same environment with diverged local state will produce conflicting resource definitions, duplicate resources, or state corruption that requires manual recovery.                                   |
+| "We haven't pinned the provider version because we want to automatically get security patches" | Unpinned providers can silently change resource behavior on `terraform init`. A `~> 5.0` constraint without an upper bound can pull a provider with breaking changes. Pin the minor version and upgrade explicitly via reviewed PRs so changes are intentional.                                          |
+| "That S3 bucket has public access because it hosts our static site"                            | Static site hosting does not require a public bucket ACL. CloudFront with an Origin Access Control (OAC) policy serves files from a private bucket. Public bucket ACLs are a common misconfiguration vector because they apply to all objects, including accidentally uploaded sensitive files.          |
+| "We'll tag resources properly before we go to production"                                      | Untagged resources accumulate. Cost allocation reports become impossible, security audits cannot identify owners, and decommissioning requires manual investigation of every resource. Tagging must be enforced at resource creation — retroactive tagging at scale is a weeks-long engineering project. |
+| "Manual changes are fine for urgent hotfixes — we'll import them to Terraform afterward"       | Manual changes without immediate import create drift that may be overwritten by the next `terraform apply`. The "import it later" step is almost never done. Every manual change that goes unimported erodes the reliability guarantee that IaC provides.                                                |
 ## Gates
 - **No local state for shared infrastructure.** Terraform configurations managing shared resources must use a remote backend with locking. Local state is blocking for any non-experimental configuration.

package/dist/agents/skills/cursor/harness-integration-test/SKILL.md CHANGED Viewed

@@ -256,6 +256,15 @@ describe('ProjectService contract', () => {
 });
 ```
+## Rationalizations to Reject
+| Rationalization                                                                           | Why It Is Wrong                                                                                                                                             |
+| ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "Testing the happy path is sufficient -- error scenarios are edge cases"                  | The success criteria require error scenarios (400, 401, 403, 404, 500, timeout) for all public endpoints. Error paths are where real-world failures happen. |
+| "We can test against the staging environment instead of setting up local mocks"           | No integration tests that require external staging environments for CI. Tests must run with local test doubles.                                             |
+| "The consumer contract changed, so I will update the consumer test to match the provider" | Contract changes must be coordinated. The provider may have introduced a bug, not an intentional change.                                                    |
+| "Tests pass when I run them in order, so they are fine"                                   | Phase 4 requires running tests in random order. Any test that fails only in a specific order has a shared-state bug.                                        |
 ## Gates
 - **No integration tests that require external staging environments for CI.** Every integration test must run with local test doubles (mocks, containers, in-memory databases). Tests that fail without a staging VPN are not integration tests -- they are environment tests.

package/dist/agents/skills/cursor/harness-integrity/SKILL.md CHANGED Viewed

@@ -122,6 +122,16 @@ Rules:
 - [ ] Unified report follows the exact format
 - [ ] Overall verdict correctly reflects both mechanical and review results
+## Rationalizations to Reject
+These are common rationalizations that sound reasonable but lead to incorrect results. When you catch yourself thinking any of these, stop and follow the documented process instead.
+| Rationalization                                                                                                | Why It Is Wrong                                                                                                                                    |
+| -------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "All three mechanical checks failed, but I should still run the AI review to get useful feedback"              | When ALL three checks fail, stop immediately. Do not proceed to Phase 2. AI review on code that does not compile is wasted effort.                 |
+| "The security scanner found a warning but it is not high severity, so it should not affect the overall result" | Error-severity security findings are blocking. The distinction is severity, not the agent's opinion of importance.                                 |
+| "The AI review flagged an architectural concern as blocking, so the integrity check should fail"               | Only runtime errors, data loss, and security vulnerabilities count as blocking review findings. Architectural concerns are noted but do not block. |
 ## Examples
 ### Example: All Clear

package/dist/agents/skills/cursor/harness-knowledge-mapper/SKILL.md CHANGED Viewed

@@ -162,6 +162,15 @@ This ensures subsequent graph queries (impact analysis, drift detection) include
 - Report follows the structured output format
 - All findings are backed by graph query evidence (with graph) or directory/file analysis (without graph)
+## Rationalizations to Reject
+| Rationalization                                                                         | Why It Is Wrong                                                                                                                                                   |
+| --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "The graph is a few commits behind, but it is close enough for knowledge mapping"       | If the graph is more than 10 commits behind, run harness scan before proceeding. A stale graph produces a knowledge map with missing modules.                     |
+| "No graph exists, so this skill cannot produce useful output"                           | The fallback strategy is explicit: use directory structure and file analysis. Fallback completeness is ~50%, significantly better than nothing.                   |
+| "The existing AGENTS.md is outdated, so I will overwrite it with the generated version" | Never overwrite without confirmation. Existing AGENTS.md may contain carefully authored context the graph cannot infer.                                           |
+| "The module descriptions I inferred from function names are accurate enough"            | Inferred descriptions are starting points. Phase 3 (AUDIT) exists to identify coverage gaps. Name-based inference misses purpose, constraints, and relationships. |
 ## Examples
 ### Example: Generating AGENTS.md from Graph

package/dist/agents/skills/cursor/harness-load-testing/SKILL.md CHANGED Viewed

@@ -259,6 +259,16 @@ Phase 4: ANALYZE
   Recommendation: Add DataLoader for orders resolver, re-test after fix
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                             | Reality                                                                                                                                                                                                                                                                  |
+| ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "The smoke test passed, so the full load test will probably be fine too."                                   | A smoke test at 1-2 VUs tells you the script runs — it says nothing about behavior at 100 or 1000 VUs. Connection pool exhaustion, lock contention, and GC pressure only appear under load. Smoke passing is the floor, not the ceiling.                                 |
+| "Staging is smaller than production, so results won't be accurate anyway — no point running the full test." | Staging results are always useful as a proxy: they reveal algorithmic bottlenecks, N+1 queries, and missing indexes that scale identically regardless of instance count. Document the scale factor and use it. Do not skip testing because the environment is imperfect. |
+| "We haven't changed the API, so the old load test baselines still apply."                                   | Baselines go stale when dependencies update, traffic patterns shift, or adjacent services change. A deployment that adds one middleware layer or changes a database index can move p99 by 200ms. Baselines must be re-validated, not assumed.                            |
+| "The p95 threshold is arbitrary — let's just relax it until the test passes."                               | A threshold without a documented basis is a guess. A threshold lowered to make a failing test pass is a suppressed regression. Thresholds must be derived from SLOs or measured baselines. If the SLO is wrong, change the SLO explicitly with stakeholder sign-off.     |
+| "We'll run the soak test later — we just need to ship the load test first."                                 | Soak tests catch failures that only emerge over hours: memory leaks, connection pool exhaustion, log file growth. If the feature involves a long-lived process, background worker, or WebSocket, skipping the soak test means the failure surfaces in production.        |
 ## Gates
 - **No load tests against production without explicit human approval.** Load tests can cause real outages. The target environment must be verified as non-production before execution. If production testing is required, a `[checkpoint:human-verify]` must be passed with documented approval.

package/dist/agents/skills/cursor/harness-ml-ops/SKILL.md CHANGED Viewed

@@ -326,6 +326,16 @@ Phase 4: VALIDATE
   After fixes: projected NEEDS_ATTENTION (missing precision/recall metrics)
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                         | Reality                                                                                                                                                                                                                                                                                                                                 |
+| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We re-trained with more data but the architecture is the same — the previous evaluation still applies."                | Evaluation results are bound to a specific model artifact, not to the architecture. A re-trained model with different weights can have dramatically different failure modes even if accuracy appears similar. Every model version that goes to production must be evaluated against the golden set, not inherited from its predecessor. |
+| "The model file is only 8MB — committing it to git is more convenient than setting up an artifact store."               | Model files in git corrupt repository history, explode clone times for all contributors, and cannot be versioned alongside experiment metadata. Convenience now creates permanent technical debt. The artifact store setup is a one-time cost; git pollution is permanent.                                                              |
+| "Loading the model inside the request handler is simpler — the model is small enough that latency won't be noticeable." | Per-request model loading adds I/O and deserialization on every inference call, holds no persistent state across requests, and collapses under any meaningful concurrency. "Small enough" is a guess without measurement. Models must be loaded at startup and held in memory.                                                          |
+| "We can add experiment tracking after we get the model working — right now we just need to iterate quickly."            | Experiment tracking is hardest to add retroactively because you cannot reconstruct the conditions of runs you did not log. The runs being executed without tracking right now are the ones producing the model that may go to production. Log them now or accept that the model is not reproducible.                                    |
+| "The prompt template is short enough to read in context — version controlling it adds unnecessary process."             | Prompts embedded in application code change silently when developers edit them, have no history of what changed and why, and cannot be evaluated independently. A prompt is a model artifact. It requires the same versioning, evaluation, and promotion discipline as model weights.                                                   |
 ## Gates
 - **No deploying models without evaluation.** A model that has not been evaluated against a golden set or baseline cannot be promoted to production. This is always an error.

package/dist/agents/skills/cursor/harness-mobile-patterns/SKILL.md CHANGED Viewed

@@ -311,6 +311,16 @@ Phase 4: VALIDATE
   Store submission ready: PASS
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                                                  | Reality                                                                                                                                                                                                                                                                                                  |
+| -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We request all permissions at launch to get them out of the way — users can deny them if they want."                            | App stores treat permissions-at-launch as a review red flag and users deny at much higher rates when there is no contextual explanation. Permissions requested at the moment they are needed, with a sentence explaining why, consistently achieve higher grant rates and reduce store rejection risk.   |
+| "Universal Links are optional — the URL scheme fallback works fine for deep linking."                                            | URL scheme fallbacks (`myapp://`) can be claimed by any installed app on the device. A malicious or coincidentally named app can intercept links intended for yours. Universal Links with verified `apple-app-site-association` files are cryptographically bound to your domain and cannot be hijacked. |
+| "The push notification handler works in foreground and background — we can handle the terminated state separately after launch." | Users often first interact with an app by tapping a push notification when the app is terminated. The cold-start tap handler is commonly the first impression. Shipping without it means a class of users experiences a broken entry point from day one.                                                 |
+| "The staging configuration is slightly different but we'll remember to change it before the App Store build."                    | "Remember to change it" is not a process. Staging URLs, debug API keys, and sandbox APNs environments in production builds have shipped before and will again. Separate build configurations and environment-specific entitlement files are the only reliable mitigation.                                |
+| "The privacy manifest requirement is new — we'll add it in the next release after the store flags it."                           | Apple has enforced PrivacyInfo.xcprivacy requirements for new submissions and updates since May 2024. Submitting without it results in rejection, which blocks the entire release. Adding it retroactively under rejection pressure is strictly more costly than adding it now.                          |
 ## Gates
 - **No missing permission usage descriptions.** Every permission requested in code must have a corresponding usage description in the platform manifest. Missing descriptions cause automatic App Store rejection on iOS and are a best practice requirement on Android.

package/dist/agents/skills/cursor/harness-mutation-test/SKILL.md CHANGED Viewed

@@ -236,6 +236,15 @@ mvn org.pitest:pitest-maven:mutationCoverage
 # Report generated at target/pit-reports/index.html
 ```
+## Rationalizations to Reject
+| Rationalization                                                                                | Why It Is Wrong                                                                                                      |
+| ---------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
+| "We have 80% line coverage, so test quality is already good"                                   | Line coverage measures execution, not verification. Mutation testing reveals missing assertions and weak assertions. |
+| "The survived mutants are in non-critical utility code, so we can ignore them"                 | Every survived mutant must be either addressed with a test or explicitly justified as an equivalent mutant.          |
+| "I will write a test that targets the specific mutation to kill it"                            | No gaming the mutation score. Every new test must test a meaningful behavior, not just kill a specific mutant.       |
+| "The test suite has some failures, but we can still run mutation testing to see what we learn" | No mutation testing against a failing test suite. Mutations against broken tests produce garbage results.            |
 ## Gates
 - **No mutation testing against a failing test suite.** All tests must pass before mutants are generated. Running mutations against broken tests produces garbage results. Fix the tests first.

package/dist/agents/skills/cursor/harness-observability/SKILL.md CHANGED Viewed

@@ -268,6 +268,16 @@ Phase 4: VALIDATE
   Result: WARN -- 3 instrumentation gaps, alerting needs SLO alignment
 ```
+## Rationalizations to Reject
+| Rationalization                                                                        | Reality                                                                                                                                                                                                                                                                                                     |
+| -------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| "We can see what's happening in CloudWatch logs — we don't need structured logging"    | Unstructured log lines cannot be queried, aggregated, or correlated across services. When an incident spans three services, searching for a request ID across unstructured logs is manual forensics. Structured logging is not a nicety — it is the foundation for incident response.                       |
+| "We'll add alerting once we've seen a few incidents and know what to alert on"         | The first incident is the worst time to define alerting. SLO-based burn rate alerts can be defined from traffic patterns before any incidents occur. Waiting for incidents to define thresholds means every early failure goes undetected.                                                                  |
+| "User ID is a useful label for the latency metric — it helps us debug per-user issues" | User ID as a metric label creates one time series per user, which at 100,000 users means 100,000 label combinations. High-cardinality labels exhaust metric storage, cause query timeouts, and make the entire metrics system unstable. Use logs for per-user debugging; use metrics for aggregate signals. |
+| "The tracing library is initialized, so we have distributed tracing"                   | Initializing the library creates root spans but does not propagate context across HTTP boundaries, instrument database calls, or connect traces to logs. Trace initialization without verified end-to-end propagation produces disconnected, useless traces.                                                |
+| "We have alerts — they're just not linked to runbooks yet"                             | An alert that fires at 3am without a runbook link requires the on-call engineer to start debugging from scratch. The absence of a runbook is not a documentation gap; it is a mean-time-to-recover multiplier.                                                                                              |
 ## Gates
 - **No sensitive data in logs.** If PII, credentials, or tokens are detected in log output, it is a blocking finding. The logging configuration must sanitize or redact sensitive fields before any other improvements are made.

package/dist/agents/skills/cursor/harness-onboarding/SKILL.md CHANGED Viewed

@@ -23,7 +23,7 @@
    - Constraints and forbidden patterns
    - Any special instructions or warnings
-2. **Read `harness.yaml`.** Extract:
+2. **Read `harness.config.json`.** Extract:
    - Project name and stack
    - Adoption level (basic, intermediate, advanced)
    - Layer definitions and their directory mappings
@@ -48,7 +48,7 @@
 2. **Map the architecture.** Walk the directory structure and identify:
    - Top-level organization pattern (monorepo, single package, workspace)
    - Source code location and entry points
-   - Layer boundaries (from `harness.yaml` and actual directory structure)
+   - Layer boundaries (from `harness.config.json` and actual directory structure)
    - Shared utilities or common modules
    - Configuration files and their purposes
@@ -61,7 +61,7 @@
    - Code formatting (detect from config files: `.prettierrc`, `.eslintrc`, `biome.json`)
 4. **Map the constraints.** Identify what is restricted:
-   - Forbidden imports (from `harness.yaml` dependency constraints)
+   - Forbidden imports (from `harness.config.json` dependency constraints)
    - Layer boundary rules (which layers can import from which)
    - Linting rules that encode architectural decisions
    - Any constraints documented in `AGENTS.md` that are not yet automated
@@ -95,8 +95,8 @@ Graph queries produce a complete architecture map in seconds, including transiti
 ### Phase 3: ORIENT — Identify Adoption Level and Maturity
-1. **Confirm the adoption level** matches what `harness.yaml` declares:
-   - Basic: `AGENTS.md` and `harness.yaml` exist but no layers or constraints
+1. **Confirm the adoption level** matches what `harness.config.json` declares:
+   - Basic: `AGENTS.md` and `harness.config.json` exist but no layers or constraints
    - Intermediate: Layers defined, dependency constraints enforced, at least one custom skill
    - Advanced: Personas, state management, learnings, CI integration
@@ -184,21 +184,29 @@ Graph queries produce a complete architecture map in seconds, including transiti
 - **`harness check-deps`** — Run to verify dependency constraints are passing, which confirms layer boundaries are respected.
 - **`harness state show`** — View current state to understand where the last session left off.
 - **`AGENTS.md`** — Primary source of project context and agent instructions.
-- **`harness.yaml`** — Source of structural configuration (layers, constraints, skills).
+- **`harness.config.json`** — Source of structural configuration (layers, constraints, skills).
 - **`.harness/learnings.md`** — Historical context and institutional knowledge.
 ## Success Criteria
-- All four configuration sources were read (`AGENTS.md`, `harness.yaml`, `.harness/learnings.md`, `.harness/state.json`)
+- All four configuration sources were read (`AGENTS.md`, `harness.config.json`, `.harness/learnings.md`, `.harness/state.json`)
 - Technology stack is accurately identified (language, framework, test runner, build tool)
 - Architecture is mapped with correct layer boundaries and dependency directions
 - Conventions are identified from actual code patterns, not assumed
-- Constraints are enumerated from both `harness.yaml` and `AGENTS.md`
+- Constraints are enumerated from both `harness.config.json` and `AGENTS.md`
 - Adoption level is confirmed (not just declared — validated)
 - A structured orientation summary is produced with all sections filled
 - The "Getting Started" section is actionable and tailored to the audience
 - `harness validate` was run and results are reported
+## Rationalizations to Reject
+| Rationalization                                                                                                            | Reality                                                                                                                                                                                          |
+| -------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| "I can skip reading .harness/learnings.md since it is just historical notes"                                               | Learnings contain hard-won insights from previous sessions -- decisions made, gotchas discovered, patterns that worked or failed. Skipping them means repeating mistakes already diagnosed.      |
+| "The harness.config.json says intermediate, so I can report that without validation"                                       | Declared adoption level must be confirmed, not assumed. A project that declares intermediate but fails harness validate is not truly intermediate.                                               |
+| "I will map the architecture by reading the directory names since that is faster than checking conventions in actual code" | Conventions must be identified from actual code patterns, not assumed from directory structure. File naming, import style, and error handling can only be verified by reading real source files. |
 ## Examples
 ### Example: Onboarding to an Intermediate TypeScript Project
@@ -211,7 +219,7 @@ Read AGENTS.md:
   - Stack: TypeScript, Express, Vitest, PostgreSQL
   - Conventions: zod validation, repository pattern, kebab-case files
-Read harness.yaml:
+Read harness.config.json:
   - Level: intermediate
   - Layers: presentation (src/routes/), business (src/services/), data (src/repositories/)
   - Constraints: presentation → business OK, business → data OK, data → presentation FORBIDDEN
@@ -258,7 +266,7 @@ Produce orientation with all sections. Getting Started for this context:
 ```
 Read AGENTS.md — exists, minimal content
-Read harness.yaml — level: basic, no layers defined
+Read harness.config.json — level: basic, no layers defined
 No .harness/learnings.md
 No .harness/state.json
 ```

package/dist/agents/skills/cursor/harness-parallel-agents/SKILL.md CHANGED Viewed

@@ -159,6 +159,15 @@ For each independent task, write a focused agent brief:
 - `harness validate` passes after integration
 - No agent modified files outside its declared scope
+## Rationalizations to Reject
+| Rationalization                                                                              | Why It Is Wrong                                                                                                                                |
+| -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
+| "These two tasks touch different functions in the same file, so they are independent enough" | If both tasks write to the same file, they are NOT independent. Even different functions in the same file creates merge conflicts.             |
+| "I verified independence manually -- no need to run check_task_independence"                 | Manual verification misses transitive dependency overlap. check_task_independence with graph-expanded analysis catches transitive conflicts.   |
+| "There are only 2 independent tasks, but parallelism would save time"                        | NOT when there are fewer than 3 independent tasks. Coordination overhead outweighs parallelism benefit for 2 tasks.                            |
+| "Each agent's tests pass, so integration is fine"                                            | Step 4 requires running the FULL test suite after integration. Parallel changes can cause integration failures that individual test runs miss. |
 ## Examples
 ### Example: Parallel Implementation of Three Independent Services