nightytidy 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +314 -0
  3. package/bin/nightytidy.js +3 -0
  4. package/package.json +55 -0
  5. package/src/checks.js +367 -0
  6. package/src/claude.js +655 -0
  7. package/src/cli.js +1012 -0
  8. package/src/consolidation.js +81 -0
  9. package/src/dashboard-html.js +496 -0
  10. package/src/dashboard-standalone.js +167 -0
  11. package/src/dashboard-tui.js +208 -0
  12. package/src/dashboard.js +427 -0
  13. package/src/env.js +100 -0
  14. package/src/executor.js +550 -0
  15. package/src/git.js +348 -0
  16. package/src/lock.js +186 -0
  17. package/src/logger.js +111 -0
  18. package/src/notifications.js +33 -0
  19. package/src/orchestrator.js +919 -0
  20. package/src/prompts/loader.js +55 -0
  21. package/src/prompts/manifest.json +138 -0
  22. package/src/prompts/specials/changelog.md +28 -0
  23. package/src/prompts/specials/consolidation.md +61 -0
  24. package/src/prompts/specials/doc-update.md +1 -0
  25. package/src/prompts/specials/report.md +95 -0
  26. package/src/prompts/steps/01-documentation.md +173 -0
  27. package/src/prompts/steps/02-test-coverage.md +181 -0
  28. package/src/prompts/steps/03-test-hardening.md +181 -0
  29. package/src/prompts/steps/04-test-architecture.md +130 -0
  30. package/src/prompts/steps/05-test-consolidation.md +165 -0
  31. package/src/prompts/steps/06-test-quality.md +211 -0
  32. package/src/prompts/steps/07-api-design.md +165 -0
  33. package/src/prompts/steps/08-security-sweep.md +207 -0
  34. package/src/prompts/steps/09-dependency-health.md +217 -0
  35. package/src/prompts/steps/10-codebase-cleanup.md +189 -0
  36. package/src/prompts/steps/11-crosscutting-concerns.md +196 -0
  37. package/src/prompts/steps/12-file-decomposition.md +263 -0
  38. package/src/prompts/steps/13-code-elegance.md +329 -0
  39. package/src/prompts/steps/14-architectural-complexity.md +297 -0
  40. package/src/prompts/steps/15-type-safety.md +192 -0
  41. package/src/prompts/steps/16-logging-error-message.md +173 -0
  42. package/src/prompts/steps/17-data-integrity.md +139 -0
  43. package/src/prompts/steps/18-performance.md +183 -0
  44. package/src/prompts/steps/19-cost-resource-optimization.md +136 -0
  45. package/src/prompts/steps/20-error-recovery.md +145 -0
  46. package/src/prompts/steps/21-race-condition-audit.md +178 -0
  47. package/src/prompts/steps/22-bug-hunt.md +229 -0
  48. package/src/prompts/steps/23-frontend-quality.md +210 -0
  49. package/src/prompts/steps/24-uiux-audit.md +284 -0
  50. package/src/prompts/steps/25-state-management.md +170 -0
  51. package/src/prompts/steps/26-perceived-performance.md +190 -0
  52. package/src/prompts/steps/27-devops.md +165 -0
  53. package/src/prompts/steps/28-scheduled-job-chron-jobs.md +141 -0
  54. package/src/prompts/steps/29-observability.md +152 -0
  55. package/src/prompts/steps/30-backup-check.md +155 -0
  56. package/src/prompts/steps/31-product-polish-ux-friction.md +122 -0
  57. package/src/prompts/steps/32-feature-discovery-opportunity.md +128 -0
  58. package/src/prompts/steps/33-strategic-opportunities.md +217 -0
  59. package/src/report.js +540 -0
  60. package/src/setup.js +133 -0
  61. package/src/sync.js +536 -0
@@ -0,0 +1,181 @@
1
+ You are running an overnight test coverage expansion. Be thorough and methodical. Your job is to dramatically improve test coverage by writing high-quality tests that catch bugs, not just inflate coverage numbers.
2
+
3
+ ## Mission
4
+
5
+ Expand coverage across six phases in order: smoke tests → coverage gap analysis → unit tests → E2E tests → mutation testing → quality assessment. Work on branch `test-coverage-[date]`.
6
+
7
+ ### Phase 1: Smoke Tests
8
+ Before doing anything else, verify the app is alive and the critical path isn't broken. Smoke tests are the bouncer at the door — if the app can't get past them, nothing else matters.
9
+
10
+ **Write and run smoke tests that verify:**
11
+ 1. **The app loads** — hitting the main URL (or running the entry point) doesn't crash or return an error
12
+ 2. **Auth works** — the login page renders or a test user can authenticate
13
+ 3. **The main page/view renders** — the primary dashboard or home screen shows up with data
14
+ 4. **The API responds** — key backend endpoints return 200, not 500
15
+ 5. **The database connects** — a basic read operation succeeds
16
+
17
+ **Standards:**
18
+ - Target 3–7 tests total. These are intentionally shallow and fast (under 30 seconds for the full smoke suite).
19
+ - Smoke tests check "is it on fire?" — not "is every feature correct." Don't test edge cases here.
20
+ - If ANY smoke test fails, stop and document the failure in the report as a **CRITICAL** finding before proceeding. Do not write deeper tests against a fundamentally broken app.
21
+ - Place smoke tests in a clearly labeled file/suite (e.g., `smoke.test.ts` or `__tests__/smoke/`) so they can be run independently after deploys.
22
+ - Match existing test conventions.
23
+
24
+ **After smoke tests pass**, proceed to deeper analysis.
25
+
26
+ ### Phase 2: Coverage Gap Analysis
27
+ Before writing tests, understand what's missing.
28
+ - Run the existing suite and generate a coverage report if tooling is available
29
+ - If not, manually identify: modules with zero tests, uncovered functions, unexercised code paths
30
+ - Categorize uncovered code by risk:
31
+ - **Critical**: Public APIs, auth, payment/billing, data mutation, user-facing
32
+ - **High**: Business logic, data transforms, validation, error handling
33
+ - **Medium**: Internal utilities, helpers, config
34
+ - **Low**: Logging, formatting, UI presentation
35
+ - Produce a prioritized list. Work top-down from Critical.
36
+
37
+ ### Phase 3: Unit Test Generation
38
+ For each uncovered function/module, starting with Critical:
39
+
40
+ **Before writing tests:** Read the function and its callees. Understand inputs, outputs, side effects, and the implicit contract. Match existing test style/conventions.
41
+
42
+ **Cover these categories:**
43
+ 1. **Happy path** — normal usage with valid inputs
44
+ 2. **Edge cases** — null/undefined/empty, boundary values (0, -1, MAX_INT), single-element collections, unicode/special chars, long strings, concurrency if applicable
45
+ 3. **Error paths** — invalid types, missing fields, network/DB failures, permission denied
46
+ 4. **State transitions** — for stateful code, test transitions not just end states
47
+
48
+ **Quality standards:**
49
+ - Descriptive test names: `should return empty array when user has no orders` not `test1`
50
+ - One assertion per test where practical; tests must be independent
51
+ - Descriptive variable names; mock external dependencies (DB, APIs, filesystem)
52
+ - Match existing file structure and conventions
53
+
54
+ **After writing tests for each module:**
55
+ - Run them — they must pass
56
+ - If a test reveals an actual bug, DO NOT fix it. Mark as skipped with `// BUG: [description]` and document in the report
57
+
58
+ ### Phase 4: End-to-End Tests
59
+
60
+ **If browser automation (Playwright MCP, etc.) is available:**
61
+ - Test critical user journeys: sign up/login/logout, core product workflow, payment/checkout, settings, any CRUD flow
62
+ - For each: happy path, validation errors, navigation, state persistence
63
+
64
+ **If not available:**
65
+ - Write API-level integration tests for critical endpoints
66
+ - Include auth in setup; test sequences representing real user workflows
67
+
68
+ **E2E standards:** Independent tests, self-managed test data with cleanup, deterministic data (not random), proper async waits (no `sleep()`), test user experience not implementation.
69
+
70
+ ### Phase 5: Mutation Testing on Critical Business Logic
71
+
72
+ Coverage tells you lines were executed, not that tests would catch bugs on those lines. Manual mutation testing answers: "If I introduced a bug, would any test catch it?"
73
+
74
+ **Step 1: Select targets (10-20 functions)**
75
+ Focus on functions where a silent bug causes: financial impact (pricing, billing, tax), data corruption (DB writes, import/export, migrations), security bypass (auth, permissions, input validation), or incorrect business decisions (analytics, threshold checks, eligibility, scoring).
76
+
77
+ Skip: presentation/UI logic, logging, test utilities, config/bootstrap, code already covered by strong contract/E2E tests.
78
+
79
+ **Step 2: Apply mutations one at a time**
80
+ For each target, apply mutations from these categories (prioritize comparison/boundary first, then arithmetic, logical, null/empty):
81
+
82
+ - **Arithmetic**: `+↔-`, `*↔/`, `%→*`, `+1→-1`, remove operation (`a+b→a`)
83
+ - **Comparison**: `>↔>=`, `<↔<=`, `==↔!=`, `>↔<`
84
+ - **Boundary**: constants ±1, array index bounds ±1, string slice ±1
85
+ - **Logical**: `&&↔||`, remove negation, remove conditional branch, `true↔false`, remove early return
86
+ - **Null/empty**: return `null`, `[]`, `{}`, `0`, or `""` instead of computed value
87
+
88
+ **For each mutation:**
89
+ 1. Make the single change
90
+ 2. Run relevant test file(s) only (not full suite)
91
+ 3. Record: **KILLED** (test failed ✓), **SURVIVED** (tests pass — gap found), **TIMED OUT** (inconclusive), or **COMPILE ERROR** (type safety win)
92
+ 4. REVERT immediately. Verify original tests pass before next mutation.
93
+
94
+ **Step 3: Write tests for surviving mutants**
95
+ For every surviving mutation, write a test that fails with the mutation and passes without it. Verify the kill by re-applying the mutation. Revert and commit: `test: add mutation-killing test for [function] — [mutation type]`
96
+
97
+ **Step 4: Assess type system kills**
98
+ Note which mutation categories types catch automatically vs. which need tests. Document functions where stronger types would improve coverage (feeds into Type Safety prompt).
99
+
100
+ **Step 5: Calculate mutation scores**
101
+ Per function: mutation score = (killed by tests + killed by types) / total × 100%. Below 80% on critical logic is a red flag.
102
+
103
+ ### Phase 6: Test Quality Assessment
104
+ For all tests (existing and new):
105
+ - Are assertions meaningful? (Calling functions without asserting results is useless)
106
+ - Are tests testing the unit or just testing mocks?
107
+ - Cross-reference Phase 5: functions with low mutation scores despite high line coverage are top priority
108
+ - For critical logic not covered by Phase 5: try flipping a comparison or removing a conditional — would any test catch it?
109
+
110
+ ## Report
111
+
112
+ Create `audit-reports/` in project root if needed. Save as `audit-reports/02_TEST_COVERAGE_REPORT_[run-number]_[date]_[time in user's local time].md`, incrementing run number based on existing reports.
113
+
114
+ ### Sections:
115
+ 1. **Summary** — Starting/ending coverage %, test files created, test cases written, pass/fail/skip counts, mutation score on critical logic, smoke test results (pass/fail)
116
+ 2. **Smoke Test Results** — Pass/fail status for each smoke test. If any failed, document what was broken and whether it was resolved before deeper testing proceeded.
117
+ 3. **Coverage Gap Analysis** — Uncovered modules by priority; covered vs. remaining
118
+ 4. **Bugs Discovered** — File, line, description, severity, and the skipped test that reveals it
119
+ 5. **Mutation Testing Results**
120
+ - Per-function table: Function | File | Risk | Mutations | Killed (tests) | Killed (types) | Survived | Score
121
+ - Overall mutation score for critical logic
122
+ - Surviving mutants addressed (new tests): Function | Mutation | New Test | Confirms Kill?
123
+ - Surviving mutants NOT addressed: Function | Mutation | Why Survived | Risk
124
+ - Type system effectiveness analysis; functions needing stronger types
125
+ 6. **Tests Written** — Organized by module with brief descriptions
126
+ 7. **Remaining Gaps** — What needs coverage and why (time, complexity, infra); functions with low mutation scores
127
+ 8. **Testing Infrastructure Recommendations** — Missing utilities, suggested patterns, infra improvements, whether a mutation framework (Stryker, mutmut, etc.) is worth adopting
128
+
129
+ ## Rules
130
+ - Branch: `test-coverage-[date]`
131
+ - Every test must pass (or be explicitly skipped with a bug comment)
132
+ - Match existing conventions — don't introduce new frameworks
133
+ - Don't modify source code — test what exists
134
+ - Document genuinely untestable functions in the report
135
+ - Prioritize CRITICAL/HIGH. Don't spend time on LOW if CRITICAL gaps remain.
136
+ - Quality over quantity. 50 meaningful tests > 200 trivial ones.
137
+ - NEVER commit a mutation. Always revert and verify before proceeding.
138
+ - Focus mutation testing on 10-20 critical functions, not the entire codebase.
139
+ - If mutation testing a single function exceeds 20 minutes, move on and note it.
140
+ - When a mutation survives, write a killing test before moving to the next function.
141
+ - Skip mutations a linter/formatter would catch — focus on plausible real-world bugs.
142
+ - If any smoke test fails, document the failure as CRITICAL before proceeding to deeper phases.
143
+ - You have all night. Be thorough.
144
+
145
+ ## Chat Output Requirement
146
+
147
+ In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish. Do not make the user open the report to get the highlights. The chat summary should include:
148
+
149
+ ### 1. Status Line
150
+ One sentence: what you did, how long it took, and whether all tests still pass.
151
+
152
+ ### 2. Key Findings
153
+ The most important things discovered — bugs, risks, wins, or surprises. Each bullet should be specific and actionable, not vague. Lead with severity or impact.
154
+
155
+ **Good:** "CRITICAL: No backup configuration found for the primary Postgres database — total data loss risk."
156
+ **Bad:** "Found some issues with backups."
157
+
158
+ ### 3. Changes Made (if applicable)
159
+ Bullet list of what was actually modified, added, or removed. Skip this section for read-only analysis runs.
160
+
161
+ ### 4. Recommendations
162
+
163
+ If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
164
+
165
+ When recommendations exist, use this table format:
166
+
167
+ | # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
168
+ |---|---|---|---|---|---|
169
+ | *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
170
+
171
+ Order rows by risk descending (Critical → High → Medium → Low). Be honest in the "Worth Doing?" column — not everything flagged is worth the engineering time. If a recommendation is marginal, say so.
172
+
173
+ ### 5. Report Location
174
+ State the full path to the detailed report file for deeper review.
175
+
176
+ ---
177
+
178
+ **Formatting rules for chat output:**
179
+ - Use markdown headers, bold for severity labels, and bullet points for scannability.
180
+ - Do not duplicate the full report contents — just the highlights and recommendations.
181
+ - If you made zero findings in a phase, say so in one line rather than omitting it silently.
@@ -0,0 +1,181 @@
1
+ # Test Hardening
2
+
3
+ ## Prompt
4
+
5
+ ```
6
+ You are running an overnight test hardening pass. You have several hours. Your job is to make the existing test suite more reliable and more complete in two specific areas: flaky test diagnosis/repair and API contract testing.
7
+
8
+ Work on a branch called `test-hardening-[date]`.
9
+
10
+ ## Your Mission
11
+
12
+ ### Phase 1: Flaky Test Diagnosis & Repair
13
+
14
+ Flaky tests are tests that sometimes pass and sometimes fail without code changes. They erode trust in the test suite and train developers to ignore failures. Your job is to find and fix them.
15
+
16
+ **Detection:**
17
+ - Run the full test suite 3-5 times in sequence
18
+ - Note any tests that produce different results across runs
19
+ - Look for tests that have been skipped/disabled with comments like "flaky", "intermittent", "timing issue", "TODO: fix"
20
+ - Search git history for tests that have been re-run in CI (if CI config is visible)
21
+ - Look for common flaky patterns even in currently-passing tests:
22
+ - Tests that depend on wall clock time or `Date.now()`
23
+ - Tests that depend on execution order (shared mutable state between tests)
24
+ - Tests that use `setTimeout` or arbitrary delays instead of proper async waiting
25
+ - Tests that depend on database auto-increment IDs or insertion order
26
+ - Tests that depend on file system state, network availability, or external services without mocking
27
+ - Tests that use random/non-deterministic data without seeding
28
+ - Tests with race conditions in async setup/teardown
29
+ - Tests that assert on floating point equality without tolerance
30
+ - Tests that depend on object key ordering or array sort stability
31
+
32
+ **For each flaky or potentially flaky test found:**
33
+ 1. Diagnose the root cause — explain WHY it's flaky
34
+ 2. Fix it:
35
+ - Replace time-dependent assertions with deterministic alternatives (mock clocks, inject time)
36
+ - Isolate shared state — each test gets its own setup
37
+ - Replace arbitrary delays with proper async waiting (waitFor, polling, event-based)
38
+ - Mock external dependencies that introduce non-determinism
39
+ - Use deterministic test data with explicit seeds if randomness is needed
40
+ - Fix setup/teardown ordering issues
41
+ 3. Run the test 5 times to verify the fix holds
42
+ 4. Commit: `fix: resolve flaky test in [module] — [root cause]`
43
+
44
+ **For currently-disabled flaky tests:**
45
+ - Attempt to fix and re-enable them
46
+ - If you can fix them, commit with: `fix: re-enable previously flaky test [name]`
47
+ - If you can't fix them, document why in the report
48
+
49
+ ### Phase 2: API Contract Testing
50
+
51
+ Verify that the actual API behavior matches what consumers expect. This catches drift between documentation, types, and reality.
52
+
53
+ **Step 1: Map all API endpoints**
54
+ - Crawl the routing layer to find every endpoint
55
+ - For each endpoint, document:
56
+ - Method (GET/POST/PUT/DELETE/PATCH)
57
+ - Path (including URL parameters)
58
+ - Expected request body schema
59
+ - Expected response body schema for each status code
60
+ - Required headers / authentication
61
+ - Query parameters
62
+
63
+ **Step 2: Compare against documentation**
64
+ - If OpenAPI/Swagger docs exist, compare the actual code against the spec
65
+ - If TypeScript types/interfaces exist for request/response, compare against actual behavior
66
+ - Flag any discrepancies:
67
+ - Endpoints that exist in code but not in docs (undocumented)
68
+ - Endpoints in docs but not in code (stale docs)
69
+ - Response fields that exist in code but not in types
70
+ - Required fields in types that are actually optional in practice
71
+ - Status codes returned that aren't documented
72
+
73
+ **Step 3: Write contract tests**
74
+ For each endpoint, write tests that verify:
75
+ - Correct response status code for valid requests
76
+ - Correct response body structure (all expected fields present, correct types)
77
+ - Correct error response format for invalid requests (400, 401, 403, 404, 422)
78
+ - Required fields are actually required (omitting them returns appropriate error)
79
+ - Optional fields work when omitted
80
+ - Pagination behavior if applicable (correct page size, next/prev links, total count)
81
+ - Content-Type headers are correct
82
+ - CORS headers are present if expected
83
+
84
+ **Contract test quality standards:**
85
+ - Tests should validate STRUCTURE and TYPES, not specific values (unless values are constants)
86
+ - Use schema validation where possible (JSON Schema, Zod, Joi — match what the project uses)
87
+ - Test against a running instance of the app with a test database — not mocked responses
88
+ - Each endpoint gets its own test file or describe block
89
+ - Include authentication setup in test fixtures
90
+
91
+ **Step 4: Identify undocumented behavior**
92
+ As you write contract tests, you'll discover behavior that isn't documented anywhere:
93
+ - Default values for optional parameters
94
+ - Implicit filtering or sorting
95
+ - Hidden query parameters that work but aren't documented
96
+ - Rate limiting behavior
97
+ - Error message formats and codes
98
+
99
+ Document all of this. It's valuable even if you don't write tests for all of it.
100
+
101
+ ## Output Requirements
102
+
103
+ Create the `audit-reports/` directory in the project root if it doesn't already exist. Save the report as `audit-reports/03_TEST_HARDENING_REPORT_[run-number]_[date]_[time in user's local time].md` (e.g., `03_TEST_HARDENING_REPORT_01_2026-02-16_2129.md`). Increment the run number based on any existing reports with the same name prefix in that folder.
104
+
105
+ ### Report Structure
106
+
107
+ 1. **Summary**
108
+ - Flaky tests found and fixed: X
109
+ - Flaky tests found but couldn't fix: X
110
+ - Previously disabled tests re-enabled: X
111
+ - API endpoints found: X
112
+ - Contract tests written: X
113
+ - Documentation discrepancies found: X
114
+
115
+ 2. **Flaky Tests Fixed**
116
+ - Table: | Test Name | File | Root Cause | Fix Applied |
117
+
118
+ 3. **Flaky Tests Unresolved**
119
+ - Table: | Test Name | File | Root Cause | Why It Couldn't Be Fixed |
120
+
121
+ 4. **API Endpoint Map**
122
+ - Complete table of all endpoints with method, path, auth requirement, and test status
123
+
124
+ 5. **Documentation Discrepancies**
125
+ - Every mismatch between docs/types and actual behavior
126
+ - Include what the docs say vs. what the code does
127
+
128
+ 6. **Undocumented Behavior**
129
+ - Behavior you discovered that isn't documented anywhere
130
+
131
+ 7. **Recommendations**
132
+ - Patterns that are causing flakiness that the team should stop using
133
+ - Suggestions for preventing future documentation drift
134
+
135
+ ## Rules
136
+ - Branch: `test-hardening-[date]`
137
+ - When fixing flaky tests, DO NOT change the test's intent — only fix the non-determinism
138
+ - If a flaky test reveals that the underlying code has a race condition, document it as a bug — don't hide it by making the test more tolerant
139
+ - For contract tests, test against the actual running app, not mocks
140
+ - Don't generate contract tests for endpoints you can't actually call (missing auth setup, etc.) — document them as gaps instead
141
+ - Match existing test framework and conventions
142
+ - You have all night. Be thorough.
143
+ ```
144
+
145
+ ## Chat Output Requirement
146
+
147
+ In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish. Do not make the user open the report to get the highlights. The chat summary should include:
148
+
149
+ ### 1. Status Line
150
+ One sentence: what you did, how long it took, and whether all tests still pass.
151
+
152
+ ### 2. Key Findings
153
+ The most important things discovered — bugs, risks, wins, or surprises. Each bullet should be specific and actionable, not vague. Lead with severity or impact.
154
+
155
+ **Good:** "CRITICAL: No backup configuration found for the primary Postgres database — total data loss risk."
156
+ **Bad:** "Found some issues with backups."
157
+
158
+ ### 3. Changes Made (if applicable)
159
+ Bullet list of what was actually modified, added, or removed. Skip this section for read-only analysis runs.
160
+
161
+ ### 4. Recommendations
162
+
163
+ If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
164
+
165
+ When recommendations exist, use this table format:
166
+
167
+ | # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
168
+ |---|---|---|---|---|---|
169
+ | *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
170
+
171
+ Order rows by risk descending (Critical → High → Medium → Low). Be honest in the "Worth Doing?" column — not everything flagged is worth the engineering time. If a recommendation is marginal, say so.
172
+
173
+ ### 5. Report Location
174
+ State the full path to the detailed report file for deeper review.
175
+
176
+ ---
177
+
178
+ **Formatting rules for chat output:**
179
+ - Use markdown headers, bold for severity labels, and bullet points for scannability.
180
+ - Do not duplicate the full report contents — just the highlights and recommendations.
181
+ - If you made zero findings in a phase, say so in one line rather than omitting it silently.
@@ -0,0 +1,130 @@
1
+ # Test Architecture & Antipattern Audit
2
+
3
+ You are running an overnight test architecture audit. Test Coverage checks quantity. Test Hardening fixes flakiness. Your job is different: determine whether the tests are actually *good* — whether they catch real regressions or just produce green checkmarks.
4
+
5
+ **READ-ONLY analysis.** Do not modify any code or create a branch.
6
+
7
+ ---
8
+
9
+ ## Global Rules
10
+
11
+ - Evaluate tests as a *regression safety net*, not as documentation or coverage metrics.
12
+ - For every antipattern found, include: file, test name, what's wrong, why it matters, and a concrete fix suggestion.
13
+ - Be honest. A 95% coverage suite full of antipatterns is worse than 60% coverage of well-written behavioral tests. Say so.
14
+ - You have all night. Read every test file.
15
+
16
+ ---
17
+
18
+ ## Phase 1: Test Inventory & Classification
19
+
20
+ **Catalog every test file.** For each, record: file path, test count, what it tests (unit/integration/E2E), framework, approximate runtime, and the source module it covers.
21
+
22
+ **Classify the suite:**
23
+ - Ratio of unit : integration : E2E tests. Is the testing pyramid inverted (too many E2E, too few unit)?
24
+ - Which modules have tests? Which have none? Which have tests that don't match current behavior?
25
+ - Are test file locations consistent (co-located vs. separate `__tests__` directory vs. mixed)?
26
+
27
+ ---
28
+
29
+ ## Phase 2: Antipattern Detection
30
+
31
+ Scan every test file for each category below. Be exhaustive — count every instance.
32
+
33
+ ### Implementation Coupling
34
+ - Tests asserting on internal method calls, private state, or execution order rather than inputs → outputs
35
+ - Tests that mock the module under test (testing the mock, not the code)
36
+ - Tests that break when you refactor internals without changing behavior
37
+ - Tests asserting exact function call counts on non-critical mocks (`expect(mock).toHaveBeenCalledTimes(3)` where 3 is an implementation detail)
38
+
39
+ ### Misleading Tests
40
+ - Test name says one thing, assertions check another ("should validate email" but only checks the function doesn't throw)
41
+ - Tests with zero assertions (run code but verify nothing — `expect` never called)
42
+ - Tests where every assertion is on a mock, not on actual output
43
+ - Tautological assertions (`expect(true).toBe(true)`, `expect(mock).toHaveBeenCalled()` on a mock called unconditionally in setup)
44
+ - `expect` inside callbacks or async blocks that never execute (test passes because the assertion is never reached)
45
+
46
+ ### Fragile Snapshots
47
+ - Snapshot tests on large objects, full HTML trees, or API responses (any change = blind update)
48
+ - Snapshot files with frequent, large diffs in git history
49
+ - Inline snapshots that are clearly auto-updated without review (formatting artifacts, irrelevant fields)
50
+
51
+ ### Mock Overuse
52
+ - Mocks more complex than the code they replace (mock setup longer than the function body)
53
+ - Mocks that re-implement business logic (now you have two things to maintain)
54
+ - Mocks of things that should just be called (pure utility functions, simple data transforms)
55
+ - Deep mock chains (`mockService.mockMethod.mockReturnValue(...)` 5+ levels deep)
56
+ - Tests that only verify mock interactions with no behavioral assertion
57
+
58
+ ### Wrong Test Level
59
+ - "Unit" tests that spin up databases, HTTP servers, or read files (integration tests in disguise)
60
+ - "Integration" tests that mock every dependency (unit tests in disguise)
61
+ - E2E tests checking implementation details that a unit test should cover
62
+ - Unit tests duplicating exact E2E test coverage with no additional edge cases
63
+
64
+ ### Shared & Leaking State
65
+ - Global `beforeAll` setup shared across unrelated tests (test order dependence)
66
+ - Mutable module-level variables modified by tests without reset
67
+ - Database/file state not cleaned up between tests
68
+ - Tests that pass individually but fail when run together (or vice versa)
69
+
70
+ ### Duplication & Bloat
71
+ - Near-identical tests with one parameter changed (should be parameterized/table-driven)
72
+ - The same setup code copy-pasted across 10+ test files
73
+ - Test helper functions that duplicate production code instead of calling it
74
+ - Tests for trivially simple code (getters, one-line pass-throughs) consuming maintenance effort
75
+
76
+ ### Test Helper Bugs
77
+ - Shared test utilities, factories, or builders — read them as carefully as production code
78
+ - Helpers that silently swallow errors, supply wrong defaults, or produce invalid test data
79
+ - Factory functions that don't match current schema (added fields missing, removed fields still present)
80
+
81
+ ---
82
+
83
+ ## Phase 3: Regression Effectiveness Assessment
84
+
85
+ **For each major module**, answer: "If a developer introduced a subtle bug in this module tomorrow, would the tests catch it?"
86
+
87
+ - Assess whether tests check *behavior* (given X input, expect Y output) or just *execution* (the function ran without crashing)
88
+ - Cross-reference with mutation testing results if a Test Coverage run exists — functions with high line coverage but low mutation scores are the worst offenders
89
+ - Identify the most dangerous gaps: code that has tests but whose tests wouldn't catch common bug types (off-by-one, null handling, boundary conditions, wrong status codes)
90
+
91
+ **Rate each module:** Strong (would catch most regressions) / Weak (covers happy path only) / Decorative (tests exist but catch almost nothing) / None
92
+
93
+ ---
94
+
95
+ ## Phase 4: Structural Assessment
96
+
97
+ - **Test organization**: Consistent? Discoverable? Can you find the tests for a given module without searching?
98
+ - **Test naming conventions**: Descriptive (`should return 404 when user not found`) or vague (`test1`, `works correctly`)?
99
+ - **Setup/teardown patterns**: Consistent? Appropriate scope? Unnecessarily broad?
100
+ - **Custom matchers/utilities**: Well-maintained? Documented? Actually used?
101
+ - **Test configuration**: Reasonable timeouts? Appropriate parallelism? Sensible defaults?
102
+
103
+ ---
104
+
105
+ ## Output
106
+
107
+ Save as `audit-reports/04_TEST_ARCHITECTURE_REPORT_[run-number]_[date]_[time in user's local time].md`.
108
+
109
+ ### Report Structure
110
+
111
+ 1. **Executive Summary** — Suite health rating (decorative / fragile / adequate / strong / excellent), antipattern count by category, regression effectiveness score, one-line verdict: "This test suite [would / would not] catch a subtle billing bug introduced on a Friday afternoon."
112
+ 2. **Test Inventory** — Classification table, pyramid ratio, coverage distribution.
113
+ 3. **Antipattern Findings** — Per category: count, worst examples (file + test name + what's wrong), fix pattern. Table: | File | Test | Antipattern | Severity | Suggested Fix |
114
+ 4. **Regression Effectiveness** — Per-module rating table: | Module | Test Count | Coverage | Effectiveness Rating | Why |
115
+ 5. **Structural Assessment** — Organization, naming, conventions, configuration.
116
+ 6. **Recommendations** — Priority-ordered. Focus on: which antipatterns to fix first (highest regression risk), which modules need test rewrites vs. additions, conventions to adopt, and whether the team should invest in better test infrastructure (factories, custom matchers, test database management).
117
+
118
+ ## Chat Output Requirement
119
+
120
+ Print a summary in conversation:
121
+
122
+ 1. **Status Line** — What you analyzed.
123
+ 2. **Key Findings** — Specific bullets with severity. "42 tests have zero assertions — they pass even if the code returns garbage." Not "found some test quality issues."
124
+ 3. **Recommendations** table (if warranted):
125
+
126
+ | # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
127
+ |---|---|---|---|---|---|
128
+ | | ≤10 words | What improves | Low–Critical | Yes/Probably/If time | 1–3 sentences |
129
+
130
+ 4. **Report Location** — Full path.
@@ -0,0 +1,165 @@
1
+ # Test Consolidation
2
+
3
+ You are running an overnight test consolidation pass. Your job: find every test that is testing the same behavioral path as another test, eliminate the duplicates, and leave the suite smaller, clearer, and no less correct. You are not improving tests — you are removing noise so the real coverage is visible.
4
+
5
+ Work on branch `test-consolidation-[date]`.
6
+
7
+ ---
8
+
9
+ ## Global Rules
10
+
11
+ - Run the full test suite before touching anything. Establish a green baseline. If it's already red, stop and document that as a CRITICAL finding — do not consolidate a broken suite.
12
+ - Run the full test suite after every consolidation. If tests go red, revert the entire change immediately and document it.
13
+ - **Never consolidate tests that cover distinct behavioral sub-cases**, even if they look similar. Two tests with different inputs are only redundant if the behavior under test is identical for both inputs.
14
+ - A merged test must be at least as expressive as the originals. If consolidation makes the intent less clear, don't do it.
15
+ - When in doubt about whether two tests are truly redundant, they are not. Document them instead.
16
+ - Make small, atomic commits — one logical consolidation per commit. Commit format: `test: consolidate duplicate [description] tests in [file]`
17
+ - You have all night. Accuracy matters more than speed.
18
+
19
+ ---
20
+
21
+ ## Phase 1: Baseline & Inventory
22
+
23
+ **Step 1: Establish baseline**
24
+ Run the full test suite. Record: total test count, pass/fail/skip counts, coverage percentage if available. If any tests are failing, stop and document before proceeding.
25
+
26
+ **Step 2: Catalog every test file**
27
+ For each file, record: path, framework, test count, and the module it covers. Note files that appear to test the same module from different angles.
28
+
29
+ ---
30
+
31
+ ## Phase 2: Duplicate Detection
32
+
33
+ Work through these categories systematically. For each duplicate group found, document before touching anything.
34
+
35
+ ### Category 1: Verbatim and near-verbatim duplicates
36
+ - Tests with identical or near-identical bodies under different names
37
+ - Tests that differ only in local variable names with no effect on what's asserted
38
+ - Copy-paste tests where setup and assertions are structurally identical
39
+ - Tests in different files that exercise the exact same code path with the exact same inputs and assert the exact same outputs
40
+
41
+ ### Category 2: Redundant happy-path saturation
42
+ Find feature areas where multiple tests all exercise the happy path with no meaningful variation — valid inputs, expected outputs, no error conditions, no edge cases. Five tests confirming the same function returns the correct value for five slightly different valid inputs is not coverage breadth; it's noise. Flag every cluster where:
43
+ - All tests use valid inputs of the same category
44
+ - No test in the group covers a different outcome or code branch
45
+ - Removing all but one would leave identical behavioral coverage
46
+
47
+ ### Category 3: Parameterizable tests
48
+ Tests that are not verbatim duplicates but differ only in input/output values and would be more clearly expressed as a single parameterized test (`it.each` / `test.each` / `@pytest.mark.parametrize` / equivalent). These are not duplicates to delete — they are duplicates to merge into a table-driven test that is more readable and easier to extend.
49
+
50
+ Candidates: test blocks that share the same `describe`, have parallel structure, and differ only in their data.
51
+
52
+ ### Category 4: Redundant cross-layer testing
53
+ Find cases where the exact same specific assertion is verified at multiple test layers — unit, integration, and E2E all asserting the identical return value or behavior. Note: this is not always bad. Cross-layer tests catch different failure modes. Only flag as redundant when the tests are truly checking the same thing at the same fidelity and neither adds what the other doesn't.
54
+
55
+ ---
56
+
57
+ ## Phase 3: Build the Consolidation Map
58
+
59
+ Before changing anything, produce a complete consolidation plan:
60
+
61
+ | Group | Files | Test Count | What They All Test | Proposed Action | Tests After | Risk Level |
62
+ |---|---|---|---|---|---|---|
63
+
64
+ **Proposed actions:**
65
+ - **Delete** — Remove all but the best-named instance (verbatim duplicates)
66
+ - **Parameterize** — Merge into a single `it.each` / data-driven test
67
+ - **Merge into describe** — Consolidate into one well-structured describe block
68
+ - **Leave** — Not redundant on closer inspection; document why
69
+
70
+ Review this plan in its entirety before executing. The plan is your contract — don't deviate during execution.
71
+
72
+ ---
73
+
74
+ ## Phase 4: Execute Consolidations
75
+
76
+ Work through the plan one group at a time.
77
+
78
+ **For each consolidation:**
79
+
80
+ 1. Re-read every test in the group to confirm your earlier analysis still holds
81
+ 2. Make the change (delete, parameterize, or restructure)
82
+ 3. Run the full test suite
83
+ 4. If green: commit with a clear message listing what was removed/merged and why
84
+ 5. If red: revert the entire change immediately, document what happened, move to the next group
85
+
86
+ **Parameterization standards:**
87
+ - Each row in the parameter table must be self-documenting — a reader should understand what case it covers without reading the test body
88
+ - Use descriptive case labels, not `case1` / `case2`
89
+ - The parameterized test must cover every case the originals covered — count assertions before and after
90
+
91
+ **Deletion standards:**
92
+ - Keep the test with the most descriptive name
93
+ - If the tests have different names that each capture something, write a new name that captures both before deleting
94
+ - Never delete a test that has a comment explaining non-obvious behavior — preserve that comment
95
+
96
+ ---
97
+
98
+ ## Phase 5: Post-Consolidation Validation
99
+
100
+ After all consolidations are complete:
101
+
102
+ 1. Run the full test suite one final time
103
+ 2. Record: new total test count, pass/fail/skip counts, coverage percentage
104
+ 3. Compare before/after: tests removed, tests parameterized, net change in coverage (should be zero or improved)
105
+ 4. Run the linter if the project has one — consolidated files may need formatting fixes
106
+
107
+ ---
108
+
109
+ ## Output
110
+
111
+ Create `audit-reports/` in project root if needed. Save as `audit-reports/05_TEST_CONSOLIDATION_REPORT_[run-number]_[date]_[time in user's local time].md`, incrementing run number based on existing reports.
112
+
113
+ ### Report Structure
114
+
115
+ 1. **Executive Summary** — Baseline test count, final test count, tests removed, tests parameterized, coverage before/after, all tests passing.
116
+
117
+ 2. **Consolidation Map** — The full plan table from Phase 3, updated with actual outcomes (executed / skipped / reverted).
118
+
119
+ 3. **Consolidations Executed** — For each: what was merged, how many tests removed, commit hash, tests passing after.
120
+
121
+ 4. **Consolidations Reverted** — What was attempted, what broke, why it couldn't be resolved safely.
122
+
123
+ 5. **Consolidations Identified but Not Executed** — Groups that were flagged but left alone: why (ambiguous intent, no safe merge, time constraints).
124
+
125
+ 6. **Remaining Redundancy** — Areas of the suite that still have high happy-path saturation, no adversarial coverage, or cross-layer redundancy — flagged here for the Test Quality & Adversarial Coverage run to address.
126
+
127
+ 7. **Recommendations** — Conventions to adopt to prevent re-accumulation (parameterized test patterns, code review checklist items, file organization suggestions).
128
+
129
+ ---
130
+
131
+ ## Chat Output Requirement
132
+
133
+ In addition to writing the full report file, you MUST print a summary directly in the conversation when you finish.
134
+
135
+ ### 1. Status Line
136
+ One sentence: what you did, how long it took, whether all tests still pass, and the before/after test count.
137
+
138
+ ### 2. Key Findings
139
+ The most important things discovered — specific and actionable.
140
+
141
+ **Good:** "Found 34 verbatim duplicate tests across `user.test.ts` and `user.service.test.ts` — all testing the same happy-path token validation with different variable names."
142
+ **Bad:** "Found some duplicate tests."
143
+
144
+ ### 3. Changes Made
145
+ Bullet list of consolidations executed. Skip if nothing was changed.
146
+
147
+ ### 4. Recommendations
148
+
149
+ If there are legitimately beneficial recommendations worth pursuing right now, present them in a table. Do **not** force recommendations — if the audit surfaced no actionable improvements, simply state that no recommendations are warranted at this time and move on.
150
+
151
+ | # | Recommendation | Impact | Risk if Ignored | Worth Doing? | Details |
152
+ |---|---|---|---|---|---|
153
+ | *Sequential number* | *Short description (≤10 words)* | *What improves if addressed* | *Low / Medium / High / Critical* | *Yes / Probably / Only if time allows* | *1–3 sentences explaining the reasoning, context, or implementation guidance* |
154
+
155
+ Order rows by risk descending. Be honest in "Worth Doing?" — not everything flagged is worth the engineering time.
156
+
157
+ ### 5. Report Location
158
+ State the full path to the detailed report file for deeper review.
159
+
160
+ ---
161
+
162
+ **Formatting rules for chat output:**
163
+ - Use markdown headers, bold for severity labels, and bullet points for scannability.
164
+ - Do not duplicate the full report contents — just the highlights and recommendations.
165
+ - If you made zero findings in a phase, say so in one line rather than omitting it silently.