npm - nightytidy - Versions diffs - 0.1.0 - Mend

nightytidy 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/LICENSE +21 -0
package/README.md +314 -0
package/bin/nightytidy.js +3 -0
package/package.json +55 -0
package/src/checks.js +367 -0
package/src/claude.js +655 -0
package/src/cli.js +1012 -0
package/src/consolidation.js +81 -0
package/src/dashboard-html.js +496 -0
package/src/dashboard-standalone.js +167 -0
package/src/dashboard-tui.js +208 -0
package/src/dashboard.js +427 -0
package/src/env.js +100 -0
package/src/executor.js +550 -0
package/src/git.js +348 -0
package/src/lock.js +186 -0
package/src/logger.js +111 -0
package/src/notifications.js +33 -0
package/src/orchestrator.js +919 -0
package/src/prompts/loader.js +55 -0
package/src/prompts/manifest.json +138 -0
package/src/prompts/specials/changelog.md +28 -0
package/src/prompts/specials/consolidation.md +61 -0
package/src/prompts/specials/doc-update.md +1 -0
package/src/prompts/specials/report.md +95 -0
package/src/prompts/steps/01-documentation.md +173 -0
package/src/prompts/steps/02-test-coverage.md +181 -0
package/src/prompts/steps/03-test-hardening.md +181 -0
package/src/prompts/steps/04-test-architecture.md +130 -0
package/src/prompts/steps/05-test-consolidation.md +165 -0
package/src/prompts/steps/06-test-quality.md +211 -0
package/src/prompts/steps/07-api-design.md +165 -0
package/src/prompts/steps/08-security-sweep.md +207 -0
package/src/prompts/steps/09-dependency-health.md +217 -0
package/src/prompts/steps/10-codebase-cleanup.md +189 -0
package/src/prompts/steps/11-crosscutting-concerns.md +196 -0
package/src/prompts/steps/12-file-decomposition.md +263 -0
package/src/prompts/steps/13-code-elegance.md +329 -0
package/src/prompts/steps/14-architectural-complexity.md +297 -0
package/src/prompts/steps/15-type-safety.md +192 -0
package/src/prompts/steps/16-logging-error-message.md +173 -0
package/src/prompts/steps/17-data-integrity.md +139 -0
package/src/prompts/steps/18-performance.md +183 -0
package/src/prompts/steps/19-cost-resource-optimization.md +136 -0
package/src/prompts/steps/20-error-recovery.md +145 -0
package/src/prompts/steps/21-race-condition-audit.md +178 -0
package/src/prompts/steps/22-bug-hunt.md +229 -0
package/src/prompts/steps/23-frontend-quality.md +210 -0
package/src/prompts/steps/24-uiux-audit.md +284 -0
package/src/prompts/steps/25-state-management.md +170 -0
package/src/prompts/steps/26-perceived-performance.md +190 -0
package/src/prompts/steps/27-devops.md +165 -0
package/src/prompts/steps/28-scheduled-job-chron-jobs.md +141 -0
package/src/prompts/steps/29-observability.md +152 -0
package/src/prompts/steps/30-backup-check.md +155 -0
package/src/prompts/steps/31-product-polish-ux-friction.md +122 -0
package/src/prompts/steps/32-feature-discovery-opportunity.md +128 -0
package/src/prompts/steps/33-strategic-opportunities.md +217 -0
package/src/report.js +540 -0
package/src/setup.js +133 -0
package/src/sync.js +536 -0

package/src/prompts/steps/02-test-coverage.md ADDED Viewed

@@ -0,0 +1,181 @@
+You are running an overnight test coverage expansion. Be thorough and methodical. Your job is to dramatically improve test coverage by writing high-quality tests that catch bugs, not just inflate coverage numbers.
+## Mission
+Expand coverage across six phases in order: smoke tests → coverage gap analysis → unit tests → E2E tests → mutation testing → quality assessment. Work on branch `test-coverage-[date]`.
+### Phase 1: Smoke Tests
+Before doing anything else, verify the app is alive and the critical path isn't broken. Smoke tests are the bouncer at the door — if the app can't get past them, nothing else matters.
+**Write and run smoke tests that verify:**
+1. **The app loads** — hitting the main URL (or running the entry point) doesn't crash or return an error
+2. **Auth works** — the login page renders or a test user can authenticate
+3. **The main page/view renders** — the primary dashboard or home screen shows up with data
+4. **The API responds** — key backend endpoints return 200, not 500
+5. **The database connects** — a basic read operation succeeds
+**Standards:**
+- Target 3–7 tests total. These are intentionally shallow and fast (under 30 seconds for the full smoke suite).
+- Smoke tests check "is it on fire?" — not "is every feature correct." Don't test edge cases here.
+- If ANY smoke test fails, stop and document the failure in the report as a **CRITICAL** finding before proceeding. Do not write deeper tests against a fundamentally broken app.
+- Place smoke tests in a clearly labeled file/suite (e.g., `smoke.test.ts` or `__tests__/smoke/`) so they can be run independently after deploys.
+- Match existing test conventions.
+**After smoke tests pass**, proceed to deeper analysis.
+### Phase 2: Coverage Gap Analysis
+Before writing tests, understand what's missing.
+- Run the existing suite and generate a coverage report if tooling is available
+- If not, manually identify: modules with zero tests, uncovered functions, unexercised code paths
+- Categorize uncovered code by risk:
+- **Critical**: Public APIs, auth, payment/billing, data mutation, user-facing
+- **High**: Business logic, data transforms, validation, error handling
+- **Medium**: Internal utilities, helpers, config
+- **Low**: Logging, formatting, UI presentation
+- Produce a prioritized list. Work top-down from Critical.
+### Phase 3: Unit Test Generation
+For each uncovered function/module, starting with Critical:
+**Before writing tests:** Read the function and its callees. Understand inputs, outputs, side effects, and the implicit contract. Match existing test style/conventions.
+**Cover these categories:**
+1. **Happy path** — normal usage with valid inputs
+2. **Edge cases** — null/undefined/empty, boundary values (0, -1, MAX_INT), single-element collections, unicode/special chars, long strings, concurrency if applicable
+3. **Error paths** — invalid types, missing fields, network/DB failures, permission denied
+4. **State transitions** — for stateful code, test transitions not just end states
+**Quality standards:**
+- Descriptive test names: `should return empty array when user has no orders` not `test1`
+- One assertion per test where practical; tests must be independent
+- Descriptive variable names; mock external dependencies (DB, APIs, filesystem)
+- Match existing file structure and conventions
+**After writing tests for each module:**
+- Run them — they must pass
+- If a test reveals an actual bug, DO NOT fix it. Mark as skipped with `// BUG: [description]` and document in the report
+### Phase 4: End-to-End Tests
+**If browser automation (Playwright MCP, etc.) is available:**
+- Test critical user journeys: sign up/login/logout, core product workflow, payment/checkout, settings, any CRUD flow
+- For each: happy path, validation errors, navigation, state persistence
+**If not available:**
+- Write API-level integration tests for critical endpoints
+- Include auth in setup; test sequences representing real user workflows
+**E2E standards:** Independent tests, self-managed test data with cleanup, deterministic data (not random), proper async waits (no `sleep()`), test user experience not implementation.
+### Phase 5: Mutation Testing on Critical Business Logic
+Coverage tells you lines were executed, not that tests would catch bugs on those lines. Manual mutation testing answers: "If I introduced a bug, would any test catch it?"
+**Step 1: Select targets (10-20 functions)**
+Focus on functions where a silent bug causes: financial impact (pricing, billing, tax), data corruption (DB writes, import/export, migrations), security bypass (auth, permissions, input validation), or incorrect business decisions (analytics, threshold checks, eligibility, scoring).
+Skip: presentation/UI logic, logging, test utilities, config/bootstrap, code already covered by strong contract/E2E tests.
+**Step 2: Apply mutations one at a time**
+For each target, apply mutations from these categories (prioritize comparison/boundary first, then arithmetic, logical, null/empty):
+- **Arithmetic**: `+↔-`, `*↔/`, `%→*`, `+1→-1`, remove operation (`a+b→a`)
+- **Comparison**: `>↔>=`, `<↔<=`, `==↔!=`, `>↔<`
+- **Boundary**: constants ±1, array index bounds ±1, string slice ±1
+- **Logical**: `&&↔||`, remove negation, remove conditional branch, `true↔false`, remove early return
+- **Null/empty**: return `null`, `[]`, `{}`, `0`, or `""` instead of computed value
+**For each mutation:**
+1. Make the single change
+2. Run relevant test file(s) only (not full suite)
+3. Record: **KILLED** (test failed ✓), **SURVIVED** (tests pass — gap found), **TIMED OUT** (inconclusive), or **COMPILE ERROR** (type safety win)
+4. REVERT immediately. Verify original tests pass before next mutation.
+**Step 3: Write tests for surviving mutants**
+For every surviving mutation, write a test that fails with the mutation and passes without it. Verify the kill by re-applying the mutation. Revert and commit: `test: add mutation-killing test for [function] — [mutation type]`
+**Step 4: Assess type system kills**
+Note which mutation categories types catch automatically vs. which need tests. Document functions where stronger types would improve coverage (feeds into Type Safety prompt).
+**Step 5: Calculate mutation scores**
+Per function: mutation score = (killed by tests + killed by types) / total × 100%. Below 80% on critical logic is a red flag.
+### Phase 6: Test Quality Assessment
+For all tests (existing and new):
+- Are assertions meaningful? (Calling functions without asserting results is useless)
+- Are tests testing the unit or just testing mocks?
+- Cross-reference Phase 5: functions with low mutation scores despite high line coverage are top priority
+- For critical logic not covered by Phase 5: try flipping a comparison or removing a conditional — would any test catch it?
+## Report
+Create `audit-reports/` in project root if needed. Save as `audit-reports/02_TEST_COVERAGE_REPORT_[run-number]_[date]_[time in user's local time].md`, incrementing run number based on existing reports.
+### Sections:
+1. **Summary** — Starting/ending coverage %, test files created, test cases written, pass/fail/skip counts, mutation score on critical logic, smoke test results (pass/fail)
+2. **Smoke Test Results** — Pass/fail status for each smoke test. If any failed, document what was broken and whether it was resolved before deeper testing proceeded.
+3. **Coverage Gap Analysis** — Uncovered modules by priority; covered vs. remaining
+4. **Bugs Discovered** — File, line, description, severity, and the skipped test that reveals it
+5. **Mutation Testing Results**
+- Per-function table: Function | File | Risk | Mutations | Killed (tests) | Killed (types) | Survived | Score
+- Overall mutation score for critical logic
+- Surviving mutants addressed (new tests): Function | Mutation | New Test | Confirms Kill?
+- Surviving mutants NOT addressed: Function | Mutation | Why Survived | Risk
+- Type system effectiveness analysis; functions needing stronger types
+6. **Tests Written** — Organized by module with brief descriptions
+7. **Remaining Gaps** — What needs coverage and why (time, complexity, infra); functions with low mutation scores
+8. **Testing Infrastructure Recommendations** — Missing utilities, suggested patterns, infra improvements, whether a mutation framework (Stryker, mutmut, etc.) is worth adopting
+## Rules
+- Branch: `test-coverage-[date]`
+- Every test must pass (or be explicitly skipped with a bug comment)
+- Match existing conventions — don't introduce new frameworks
+- Don't modify source code — test what exists
+- Document genuinely untestable functions in the report
+- Prioritize CRITICAL/HIGH. Don't spend time on LOW if CRITICAL gaps remain.
+- Quality over quantity. 50 meaningful tests > 200 trivial ones.
+- NEVER commit a mutation. Always revert and verify before proceeding.
+- Focus mutation testing on 10-20 critical functions, not the entire codebase.
+- If mutation testing a single function exceeds 20 minutes, move on and note it.
+- When a mutation survives, write a killing test before moving to the next function.
+- Skip mutations a linter/formatter would catch — focus on plausible real-world bugs.
+- If any smoke test fails, document the failure as CRITICAL before proceeding to deeper phases.
+- You have all night. Be thorough.
+## Chat Output Requirement
+In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish. Do not make the user open the report to get the highlights. The chat summary should include:
+### 1. Status Line
+One sentence: what you did, how long it took, and whether all tests still pass.
+### 2. Key Findings
+The most important things discovered — bugs, risks, wins, or surprises. Each bullet should be specific and actionable, not vague. Lead with severity or impact.
+**Good:** "CRITICAL: No backup configuration found for the primary Postgres database — total data loss risk."
+**Bad:** "Found some issues with backups."
+### 3. Changes Made (if applicable)
+Bullet list of what was actually modified, added, or removed. Skip this section for read-only analysis runs.
+### 4. Recommendations
+If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
+When recommendations exist, use this table format:
+| # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
+|---|---|---|---|---|---|
+| *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
+Order rows by risk descending (Critical → High → Medium → Low). Be honest in the "Worth Doing?" column — not everything flagged is worth the engineering time. If a recommendation is marginal, say so.
+### 5. Report Location
+State the full path to the detailed report file for deeper review.
+---
+**Formatting rules for chat output:**
+- Use markdown headers, bold for severity labels, and bullet points for scannability.
+- Do not duplicate the full report contents — just the highlights and recommendations.
+- If you made zero findings in a phase, say so in one line rather than omitting it silently.

package/src/prompts/steps/03-test-hardening.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Test Hardening
+## Prompt
+```
+You are running an overnight test hardening pass. You have several hours. Your job is to make the existing test suite more reliable and more complete in two specific areas: flaky test diagnosis/repair and API contract testing.
+Work on a branch called `test-hardening-[date]`.
+## Your Mission
+### Phase 1: Flaky Test Diagnosis & Repair
+Flaky tests are tests that sometimes pass and sometimes fail without code changes. They erode trust in the test suite and train developers to ignore failures. Your job is to find and fix them.
+**Detection:**
+- Run the full test suite 3-5 times in sequence
+- Note any tests that produce different results across runs
+- Look for tests that have been skipped/disabled with comments like "flaky", "intermittent", "timing issue", "TODO: fix"
+- Search git history for tests that have been re-run in CI (if CI config is visible)
+- Look for common flaky patterns even in currently-passing tests:
+- Tests that depend on wall clock time or `Date.now()`
+- Tests that depend on execution order (shared mutable state between tests)
+- Tests that use `setTimeout` or arbitrary delays instead of proper async waiting
+- Tests that depend on database auto-increment IDs or insertion order
+- Tests that depend on file system state, network availability, or external services without mocking
+- Tests that use random/non-deterministic data without seeding
+- Tests with race conditions in async setup/teardown
+- Tests that assert on floating point equality without tolerance
+- Tests that depend on object key ordering or array sort stability
+**For each flaky or potentially flaky test found:**
+1. Diagnose the root cause — explain WHY it's flaky
+2. Fix it:
+- Replace time-dependent assertions with deterministic alternatives (mock clocks, inject time)
+- Isolate shared state — each test gets its own setup
+- Replace arbitrary delays with proper async waiting (waitFor, polling, event-based)
+- Mock external dependencies that introduce non-determinism
+- Use deterministic test data with explicit seeds if randomness is needed
+- Fix setup/teardown ordering issues
+3. Run the test 5 times to verify the fix holds
+4. Commit: `fix: resolve flaky test in [module] — [root cause]`
+**For currently-disabled flaky tests:**
+- Attempt to fix and re-enable them
+- If you can fix them, commit with: `fix: re-enable previously flaky test [name]`
+- If you can't fix them, document why in the report
+### Phase 2: API Contract Testing
+Verify that the actual API behavior matches what consumers expect. This catches drift between documentation, types, and reality.
+**Step 1: Map all API endpoints**
+- Crawl the routing layer to find every endpoint
+- For each endpoint, document:
+- Method (GET/POST/PUT/DELETE/PATCH)
+- Path (including URL parameters)
+- Expected request body schema
+- Expected response body schema for each status code
+- Required headers / authentication
+- Query parameters
+**Step 2: Compare against documentation**
+- If OpenAPI/Swagger docs exist, compare the actual code against the spec
+- If TypeScript types/interfaces exist for request/response, compare against actual behavior
+- Flag any discrepancies:
+- Endpoints that exist in code but not in docs (undocumented)
+- Endpoints in docs but not in code (stale docs)
+- Response fields that exist in code but not in types
+- Required fields in types that are actually optional in practice
+- Status codes returned that aren't documented
+**Step 3: Write contract tests**
+For each endpoint, write tests that verify:
+- Correct response status code for valid requests
+- Correct response body structure (all expected fields present, correct types)
+- Correct error response format for invalid requests (400, 401, 403, 404, 422)
+- Required fields are actually required (omitting them returns appropriate error)
+- Optional fields work when omitted
+- Pagination behavior if applicable (correct page size, next/prev links, total count)
+- Content-Type headers are correct
+- CORS headers are present if expected
+**Contract test quality standards:**
+- Tests should validate STRUCTURE and TYPES, not specific values (unless values are constants)
+- Use schema validation where possible (JSON Schema, Zod, Joi — match what the project uses)
+- Test against a running instance of the app with a test database — not mocked responses
+- Each endpoint gets its own test file or describe block
+- Include authentication setup in test fixtures
+**Step 4: Identify undocumented behavior**
+As you write contract tests, you'll discover behavior that isn't documented anywhere:
+- Default values for optional parameters
+- Implicit filtering or sorting
+- Hidden query parameters that work but aren't documented
+- Rate limiting behavior
+- Error message formats and codes
+Document all of this. It's valuable even if you don't write tests for all of it.
+## Output Requirements
+Create the `audit-reports/` directory in the project root if it doesn't already exist. Save the report as `audit-reports/03_TEST_HARDENING_REPORT_[run-number]_[date]_[time in user's local time].md` (e.g., `03_TEST_HARDENING_REPORT_01_2026-02-16_2129.md`). Increment the run number based on any existing reports with the same name prefix in that folder.
+### Report Structure
+1. **Summary**
+- Flaky tests found and fixed: X
+- Flaky tests found but couldn't fix: X
+- Previously disabled tests re-enabled: X
+- API endpoints found: X
+- Contract tests written: X
+- Documentation discrepancies found: X
+2. **Flaky Tests Fixed**
+- Table: | Test Name | File | Root Cause | Fix Applied |
+3. **Flaky Tests Unresolved**
+- Table: | Test Name | File | Root Cause | Why It Couldn't Be Fixed |
+4. **API Endpoint Map**
+- Complete table of all endpoints with method, path, auth requirement, and test status
+5. **Documentation Discrepancies**
+- Every mismatch between docs/types and actual behavior
+- Include what the docs say vs. what the code does
+6. **Undocumented Behavior**
+- Behavior you discovered that isn't documented anywhere
+7. **Recommendations**
+- Patterns that are causing flakiness that the team should stop using
+- Suggestions for preventing future documentation drift
+## Rules
+- Branch: `test-hardening-[date]`
+- When fixing flaky tests, DO NOT change the test's intent — only fix the non-determinism
+- If a flaky test reveals that the underlying code has a race condition, document it as a bug — don't hide it by making the test more tolerant
+- For contract tests, test against the actual running app, not mocks
+- Don't generate contract tests for endpoints you can't actually call (missing auth setup, etc.) — document them as gaps instead
+- Match existing test framework and conventions
+- You have all night. Be thorough.
+```
+## Chat Output Requirement
+In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish. Do not make the user open the report to get the highlights. The chat summary should include:
+### 1. Status Line
+One sentence: what you did, how long it took, and whether all tests still pass.
+### 2. Key Findings
+The most important things discovered — bugs, risks, wins, or surprises. Each bullet should be specific and actionable, not vague. Lead with severity or impact.
+**Good:** "CRITICAL: No backup configuration found for the primary Postgres database — total data loss risk."
+**Bad:** "Found some issues with backups."
+### 3. Changes Made (if applicable)
+Bullet list of what was actually modified, added, or removed. Skip this section for read-only analysis runs.
+### 4. Recommendations
+If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
+When recommendations exist, use this table format:
+| # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
+|---|---|---|---|---|---|
+| *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
+Order rows by risk descending (Critical → High → Medium → Low). Be honest in the "Worth Doing?" column — not everything flagged is worth the engineering time. If a recommendation is marginal, say so.
+### 5. Report Location
+State the full path to the detailed report file for deeper review.
+---
+**Formatting rules for chat output:**
+- Use markdown headers, bold for severity labels, and bullet points for scannability.
+- Do not duplicate the full report contents — just the highlights and recommendations.
+- If you made zero findings in a phase, say so in one line rather than omitting it silently.

package/src/prompts/steps/04-test-architecture.md ADDED Viewed

@@ -0,0 +1,130 @@
+# Test Architecture & Antipattern Audit
+You are running an overnight test architecture audit. Test Coverage checks quantity. Test Hardening fixes flakiness. Your job is different: determine whether the tests are actually *good* — whether they catch real regressions or just produce green checkmarks.
+**READ-ONLY analysis.** Do not modify any code or create a branch.
+---
+## Global Rules
+- Evaluate tests as a *regression safety net*, not as documentation or coverage metrics.
+- For every antipattern found, include: file, test name, what's wrong, why it matters, and a concrete fix suggestion.
+- Be honest. A 95% coverage suite full of antipatterns is worse than 60% coverage of well-written behavioral tests. Say so.
+- You have all night. Read every test file.
+---
+## Phase 1: Test Inventory & Classification
+**Catalog every test file.** For each, record: file path, test count, what it tests (unit/integration/E2E), framework, approximate runtime, and the source module it covers.
+**Classify the suite:**
+- Ratio of unit : integration : E2E tests. Is the testing pyramid inverted (too many E2E, too few unit)?
+- Which modules have tests? Which have none? Which have tests that don't match current behavior?
+- Are test file locations consistent (co-located vs. separate `__tests__` directory vs. mixed)?
+---
+## Phase 2: Antipattern Detection
+Scan every test file for each category below. Be exhaustive — count every instance.
+### Implementation Coupling
+- Tests asserting on internal method calls, private state, or execution order rather than inputs → outputs
+- Tests that mock the module under test (testing the mock, not the code)
+- Tests that break when you refactor internals without changing behavior
+- Tests asserting exact function call counts on non-critical mocks (`expect(mock).toHaveBeenCalledTimes(3)` where 3 is an implementation detail)
+### Misleading Tests
+- Test name says one thing, assertions check another ("should validate email" but only checks the function doesn't throw)
+- Tests with zero assertions (run code but verify nothing — `expect` never called)
+- Tests where every assertion is on a mock, not on actual output
+- Tautological assertions (`expect(true).toBe(true)`, `expect(mock).toHaveBeenCalled()` on a mock called unconditionally in setup)
+- `expect` inside callbacks or async blocks that never execute (test passes because the assertion is never reached)
+### Fragile Snapshots
+- Snapshot tests on large objects, full HTML trees, or API responses (any change = blind update)
+- Snapshot files with frequent, large diffs in git history
+- Inline snapshots that are clearly auto-updated without review (formatting artifacts, irrelevant fields)
+### Mock Overuse
+- Mocks more complex than the code they replace (mock setup longer than the function body)
+- Mocks that re-implement business logic (now you have two things to maintain)
+- Mocks of things that should just be called (pure utility functions, simple data transforms)
+- Deep mock chains (`mockService.mockMethod.mockReturnValue(...)` 5+ levels deep)
+- Tests that only verify mock interactions with no behavioral assertion
+### Wrong Test Level
+- "Unit" tests that spin up databases, HTTP servers, or read files (integration tests in disguise)
+- "Integration" tests that mock every dependency (unit tests in disguise)
+- E2E tests checking implementation details that a unit test should cover
+- Unit tests duplicating exact E2E test coverage with no additional edge cases
+### Shared & Leaking State
+- Global `beforeAll` setup shared across unrelated tests (test order dependence)
+- Mutable module-level variables modified by tests without reset
+- Database/file state not cleaned up between tests
+- Tests that pass individually but fail when run together (or vice versa)
+### Duplication & Bloat
+- Near-identical tests with one parameter changed (should be parameterized/table-driven)
+- The same setup code copy-pasted across 10+ test files
+- Test helper functions that duplicate production code instead of calling it
+- Tests for trivially simple code (getters, one-line pass-throughs) consuming maintenance effort
+### Test Helper Bugs
+- Shared test utilities, factories, or builders — read them as carefully as production code
+- Helpers that silently swallow errors, supply wrong defaults, or produce invalid test data
+- Factory functions that don't match current schema (added fields missing, removed fields still present)
+---
+## Phase 3: Regression Effectiveness Assessment
+**For each major module**, answer: "If a developer introduced a subtle bug in this module tomorrow, would the tests catch it?"
+- Assess whether tests check *behavior* (given X input, expect Y output) or just *execution* (the function ran without crashing)
+- Cross-reference with mutation testing results if a Test Coverage run exists — functions with high line coverage but low mutation scores are the worst offenders
+- Identify the most dangerous gaps: code that has tests but whose tests wouldn't catch common bug types (off-by-one, null handling, boundary conditions, wrong status codes)
+**Rate each module:** Strong (would catch most regressions) / Weak (covers happy path only) / Decorative (tests exist but catch almost nothing) / None
+---
+## Phase 4: Structural Assessment
+- **Test organization**: Consistent? Discoverable? Can you find the tests for a given module without searching?
+- **Test naming conventions**: Descriptive (`should return 404 when user not found`) or vague (`test1`, `works correctly`)?
+- **Setup/teardown patterns**: Consistent? Appropriate scope? Unnecessarily broad?
+- **Custom matchers/utilities**: Well-maintained? Documented? Actually used?
+- **Test configuration**: Reasonable timeouts? Appropriate parallelism? Sensible defaults?
+---
+## Output
+Save as `audit-reports/04_TEST_ARCHITECTURE_REPORT_[run-number]_[date]_[time in user's local time].md`.
+### Report Structure
+1. **Executive Summary** — Suite health rating (decorative / fragile / adequate / strong / excellent), antipattern count by category, regression effectiveness score, one-line verdict: "This test suite [would / would not] catch a subtle billing bug introduced on a Friday afternoon."
+2. **Test Inventory** — Classification table, pyramid ratio, coverage distribution.
+3. **Antipattern Findings** — Per category: count, worst examples (file + test name + what's wrong), fix pattern. Table: | File | Test | Antipattern | Severity | Suggested Fix |
+4. **Regression Effectiveness** — Per-module rating table: | Module | Test Count | Coverage | Effectiveness Rating | Why |
+5. **Structural Assessment** — Organization, naming, conventions, configuration.
+6. **Recommendations** — Priority-ordered. Focus on: which antipatterns to fix first (highest regression risk), which modules need test rewrites vs. additions, conventions to adopt, and whether the team should invest in better test infrastructure (factories, custom matchers, test database management).
+## Chat Output Requirement
+Print a summary in conversation:
+1. **Status Line** — What you analyzed.
+2. **Key Findings** — Specific bullets with severity. "42 tests have zero assertions — they pass even if the code returns garbage." Not "found some test quality issues."
+3. **Recommendations** table (if warranted):
+| # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
+|---|---|---|---|---|---|
+| | ≤10 words | What improves | Low–Critical | Yes/Probably/If time | 1–3 sentences |
+4. **Report Location** — Full path.

package/src/prompts/steps/05-test-consolidation.md ADDED Viewed

@@ -0,0 +1,165 @@
+# Test Consolidation
+You are running an overnight test consolidation pass. Your job: find every test that is testing the same behavioral path as another test, eliminate the duplicates, and leave the suite smaller, clearer, and no less correct. You are not improving tests — you are removing noise so the real coverage is visible.
+Work on branch `test-consolidation-[date]`.
+---
+## Global Rules
+- Run the full test suite before touching anything. Establish a green baseline. If it's already red, stop and document that as a CRITICAL finding — do not consolidate a broken suite.
+- Run the full test suite after every consolidation. If tests go red, revert the entire change immediately and document it.
+- **Never consolidate tests that cover distinct behavioral sub-cases**, even if they look similar. Two tests with different inputs are only redundant if the behavior under test is identical for both inputs.
+- A merged test must be at least as expressive as the originals. If consolidation makes the intent less clear, don't do it.
+- When in doubt about whether two tests are truly redundant, they are not. Document them instead.
+- Make small, atomic commits — one logical consolidation per commit. Commit format: `test: consolidate duplicate [description] tests in [file]`
+- You have all night. Accuracy matters more than speed.
+---
+## Phase 1: Baseline & Inventory
+**Step 1: Establish baseline**
+Run the full test suite. Record: total test count, pass/fail/skip counts, coverage percentage if available. If any tests are failing, stop and document before proceeding.
+**Step 2: Catalog every test file**
+For each file, record: path, framework, test count, and the module it covers. Note files that appear to test the same module from different angles.
+---
+## Phase 2: Duplicate Detection
+Work through these categories systematically. For each duplicate group found, document before touching anything.
+### Category 1: Verbatim and near-verbatim duplicates
+- Tests with identical or near-identical bodies under different names
+- Tests that differ only in local variable names with no effect on what's asserted
+- Copy-paste tests where setup and assertions are structurally identical
+- Tests in different files that exercise the exact same code path with the exact same inputs and assert the exact same outputs
+### Category 2: Redundant happy-path saturation
+Find feature areas where multiple tests all exercise the happy path with no meaningful variation — valid inputs, expected outputs, no error conditions, no edge cases. Five tests confirming the same function returns the correct value for five slightly different valid inputs is not coverage breadth; it's noise. Flag every cluster where:
+- All tests use valid inputs of the same category
+- No test in the group covers a different outcome or code branch
+- Removing all but one would leave identical behavioral coverage
+### Category 3: Parameterizable tests
+Tests that are not verbatim duplicates but differ only in input/output values and would be more clearly expressed as a single parameterized test (`it.each` / `test.each` / `@pytest.mark.parametrize` / equivalent). These are not duplicates to delete — they are duplicates to merge into a table-driven test that is more readable and easier to extend.
+Candidates: test blocks that share the same `describe`, have parallel structure, and differ only in their data.
+### Category 4: Redundant cross-layer testing
+Find cases where the exact same specific assertion is verified at multiple test layers — unit, integration, and E2E all asserting the identical return value or behavior. Note: this is not always bad. Cross-layer tests catch different failure modes. Only flag as redundant when the tests are truly checking the same thing at the same fidelity and neither adds what the other doesn't.
+---
+## Phase 3: Build the Consolidation Map
+Before changing anything, produce a complete consolidation plan:
+| Group | Files | Test Count | What They All Test | Proposed Action | Tests After | Risk Level |
+|---|---|---|---|---|---|---|
+**Proposed actions:**
+- **Delete** — Remove all but the best-named instance (verbatim duplicates)
+- **Parameterize** — Merge into a single `it.each` / data-driven test
+- **Merge into describe** — Consolidate into one well-structured describe block
+- **Leave** — Not redundant on closer inspection; document why
+Review this plan in its entirety before executing. The plan is your contract — don't deviate during execution.
+---
+## Phase 4: Execute Consolidations
+Work through the plan one group at a time.
+**For each consolidation:**
+1. Re-read every test in the group to confirm your earlier analysis still holds
+2. Make the change (delete, parameterize, or restructure)
+3. Run the full test suite
+4. If green: commit with a clear message listing what was removed/merged and why
+5. If red: revert the entire change immediately, document what happened, move to the next group
+**Parameterization standards:**
+- Each row in the parameter table must be self-documenting — a reader should understand what case it covers without reading the test body
+- Use descriptive case labels, not `case1` / `case2`
+- The parameterized test must cover every case the originals covered — count assertions before and after
+**Deletion standards:**
+- Keep the test with the most descriptive name
+- If the tests have different names that each capture something, write a new name that captures both before deleting
+- Never delete a test that has a comment explaining non-obvious behavior — preserve that comment
+---
+## Phase 5: Post-Consolidation Validation
+After all consolidations are complete:
+1. Run the full test suite one final time
+2. Record: new total test count, pass/fail/skip counts, coverage percentage
+3. Compare before/after: tests removed, tests parameterized, net change in coverage (should be zero or improved)
+4. Run the linter if the project has one — consolidated files may need formatting fixes
+---
+## Output
+Create `audit-reports/` in project root if needed. Save as `audit-reports/05_TEST_CONSOLIDATION_REPORT_[run-number]_[date]_[time in user's local time].md`, incrementing run number based on existing reports.
+### Report Structure
+1. **Executive Summary** — Baseline test count, final test count, tests removed, tests parameterized, coverage before/after, all tests passing.
+2. **Consolidation Map** — The full plan table from Phase 3, updated with actual outcomes (executed / skipped / reverted).
+3. **Consolidations Executed** — For each: what was merged, how many tests removed, commit hash, tests passing after.
+4. **Consolidations Reverted** — What was attempted, what broke, why it couldn't be resolved safely.
+5. **Consolidations Identified but Not Executed** — Groups that were flagged but left alone: why (ambiguous intent, no safe merge, time constraints).
+6. **Remaining Redundancy** — Areas of the suite that still have high happy-path saturation, no adversarial coverage, or cross-layer redundancy — flagged here for the Test Quality & Adversarial Coverage run to address.
+7. **Recommendations** — Conventions to adopt to prevent re-accumulation (parameterized test patterns, code review checklist items, file organization suggestions).
+---
+## Chat Output Requirement
+In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish.
+### 1. Status Line
+One sentence: what you did, how long it took, whether all tests still pass, and the before/after test count.
+### 2. Key Findings
+The most important things discovered — specific and actionable.
+**Good:** "Found 34 verbatim duplicate tests across `user.test.ts` and `user.service.test.ts` — all testing the same happy-path token validation with different variable names."
+**Bad:** "Found some duplicate tests."
+### 3. Changes Made
+Bullet list of consolidations executed. Skip if nothing was changed.
+### 4. Recommendations
+If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
+| # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
+|---|---|---|---|---|---|
+| *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
+Order rows by risk descending. Be honest in "Worth Doing?" — not everything flagged is worth the engineering time.
+### 5. Report Location
+State the full path to the detailed report file for deeper review.
+---
+**Formatting rules for chat output:**
+- Use markdown headers, bold for severity labels, and bullet points for scannability.
+- Do not duplicate the full report contents — just the highlights and recommendations.
+- If you made zero findings in a phase, say so in one line rather than omitting it silently.