PyPI - sumo-qa - Versions diffs - 0.1.0__py3-none-any.whl - Mend

sumo-qa 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

sumo_qa/__init__.py +3 -0
sumo_qa/__main__.py +4 -0
sumo_qa/_data/knowledge/approaches.md +46 -0
sumo_qa/_data/knowledge/classifications.md +54 -0
sumo_qa/_data/knowledge/principles.md +79 -0
sumo_qa/_data/knowledge/specialty_tools.md +54 -0
sumo_qa/_data/knowledge/techniques.md +87 -0
sumo_qa/_data/knowledge/test_data/auth/sample_accounts.yaml +38 -0
sumo_qa/_data/knowledge/test_data/billing/sample_invoices.yaml +38 -0
sumo_qa/_data/skills/qa-answering-testing-question/SKILL.md +82 -0
sumo_qa/_data/skills/qa-creating-test-plan/SKILL.md +114 -0
sumo_qa/_data/skills/qa-deciding-approach/SKILL.md +104 -0
sumo_qa/_data/skills/qa-executing-qa-rollout/SKILL.md +112 -0
sumo_qa/_data/skills/qa-executing-qa-rollout/prompts/implementer-prompt.md +60 -0
sumo_qa/_data/skills/qa-executing-qa-rollout/prompts/quality-reviewer-prompt.md +56 -0
sumo_qa/_data/skills/qa-executing-qa-rollout/prompts/spec-reviewer-prompt.md +51 -0
sumo_qa/_data/skills/qa-finding-test-data/SKILL.md +97 -0
sumo_qa/_data/skills/qa-finishing-qa-work/SKILL.md +134 -0
sumo_qa/_data/skills/qa-implementing-with-tdd/SKILL.md +116 -0
sumo_qa/_data/skills/qa-planning-qa-rollout/SKILL.md +160 -0
sumo_qa/_data/skills/qa-preparing-for-work/SKILL.md +87 -0
sumo_qa/_data/skills/qa-reviewing-before-merge/SKILL.md +115 -0
sumo_qa/_data/skills/qa-strengthening-tests/SKILL.md +116 -0
sumo_qa/_data/skills/sumo-qa-strategising/SKILL.md +119 -0
sumo_qa/_data/skills/using-sumo-qa/SKILL.md +140 -0
sumo_qa/_data/standards/packs/istqb_v1.yml +135 -0
sumo_qa/_data/standards/packs/qa_shift_left_v1.yml +51 -0
sumo_qa/_data/standards/rules/change_rules.yaml +237 -0
sumo_qa/debug_capture.py +51 -0
sumo_qa/knowledge_loaders.py +138 -0
sumo_qa/rules.py +130 -0
sumo_qa/server.py +316 -0
sumo_qa/skill_prompts.py +119 -0
sumo_qa/standards.py +127 -0
sumo_qa/tdm_catalogue.py +179 -0
sumo_qa/tdm_models.py +121 -0
sumo_qa/tdm_service.py +378 -0
sumo_qa/tdm_validation.py +151 -0
sumo_qa/tools.py +174 -0
sumo_qa-0.1.0.dist-info/METADATA +125 -0
sumo_qa-0.1.0.dist-info/RECORD +45 -0
sumo_qa-0.1.0.dist-info/WHEEL +4 -0
sumo_qa-0.1.0.dist-info/entry_points.txt +2 -0
sumo_qa-0.1.0.dist-info/licenses/LICENSE +202 -0
sumo_qa-0.1.0.dist-info/licenses/NOTICE +3 -0

sumo_qa/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+"""QA Shift-Left MCP server package."""
+__version__ = "0.1.0"

sumo_qa/__main__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from sumo_qa.server import main
+if __name__ == "__main__":
+    main()

sumo_qa/_data/knowledge/approaches.md ADDED Viewed

@@ -0,0 +1,46 @@
+# Canonical QA approaches
+Eight canonical approaches the host LLM picks from when deciding the shape
+of QA work. The LLM may invent a new approach if the situation genuinely
+needs one, but it must explain why none of these eight fit.
+## strategy-orchestration
+Repo-wide / policy-shaped ask: "design a test strategy", "audit our coverage",
+"design our pyramid", "rollout to other services", "minimum viable QA setup".
+Do NOT force per-change output. Next step is loading the
+`sumo-qa-strategising` skill.
+## tdd-scaffold
+Greenfield-ish change adding behaviour. Plan -> scaffold -> red ->
+implement -> green. Fits when production code is being written for new
+functionality.
+## regression-first
+Bug fix on existing code. Reproduce the defect as one failing test, fix it,
+confirm green, run targeted regression. Fits when the user describes a
+specific defect on existing behaviour.
+## coverage-first-then-refactor
+Behaviour-preserving refactor (rename, extract method, split module, restructure
+without changing outputs). Audit existing coverage and add characterization
+tests BEFORE refactoring. Tests pin current behaviour so the refactor doesn't
+silently change outputs.
+## strengthen-test-coverage
+Strengthen existing tests on UNCHANGED production code. Mutation-testing
+follow-up, raise-coverage tasks, killing weak assertions. Production code
+stays still. Equivalent mutants get suppressed in tool config rather than
+chased with tautological tests.
+## verify-existing
+Config-only or trivial tweak that doesn't merit new tests. Run the existing
+suite plus a smoke test. Fits when a config bump or mechanical edit needs
+confirmation, not new coverage.
+## no-tests-recommended
+Pure docs / typos / comments. Build + lint, no QA test work. The honest
+senior-QA answer when the change has no behavioural surface.
+## spike-first-then-tests
+Exploratory prototype. Defer test discipline until the design settles. The
+deliverable is a captured-conditions-and-fit record, not scaffolded tests.

sumo_qa/_data/knowledge/classifications.md ADDED Viewed

@@ -0,0 +1,54 @@
+# Canonical change classifications
+Ten canonical classifications used to shape testing strategy. The host LLM picks
+which apply to a given change by reasoning over the user's intent and target
+paths. The catalogue below is authoritative — do not invent classifications
+not in this list.
+## api_contract_change
+A change that adds, removes, or modifies a public API surface (HTTP endpoint,
+gRPC method, public library function, event schema). Risk: downstream
+consumers break on signature drift.
+## business_logic_change
+A change to domain rules, calculations, decision logic, or state machines.
+Risk: incorrect outcomes for valid inputs.
+## security_change
+A change touching authentication, authorisation, secrets handling, encryption,
+input sanitisation, rate limiting, audit logging. Risk: privilege escalation,
+data leak, regression of a security control.
+## performance_change
+A change motivated by latency, throughput, memory, or resource consumption,
+including caching, batching, query plan changes, and indexing.
+Risk: regression in p99 or memory profile under load.
+## frontend_change
+A change to UI components, page layout, accessibility tree, client-side
+interaction, or rendering. Risk: visual / interaction regressions, a11y
+regressions.
+## infrastructure_change
+A change to deployment, IaC, runtime configuration, networking, or platform-
+level concerns (Kubernetes manifests, Terraform, Docker, CI). Risk:
+environment-only failures invisible in unit tests.
+## test_change
+A change exclusively to test code or test fixtures, with no production code
+movement. Includes mutation-testing follow-up, raise-coverage tasks,
+strengthening weak assertions, and refactoring tests. Risk: false confidence
+if tests become tautological.
+## docs_change
+A change to documentation, comments, README, or any non-executable artefact.
+Risk: minimal — typically no QA test work needed beyond build/lint.
+## config_change
+A change to configuration files (YAML, JSON, env files, feature flags) where
+the configuration is consumed at runtime by existing code paths. Risk:
+behaviour drift via the config without code review catching it.
+## data_migration
+A change that transforms persisted data (schema migration, backfill, ETL).
+Risk: data loss, broken referential integrity, partial migrations.

sumo_qa/_data/knowledge/principles.md ADDED Viewed

@@ -0,0 +1,79 @@
+# QA principles — ISTQB Foundation, Advanced, ISO 25010
+## ISTQB Foundation — the seven testing principles
+1. Testing shows the presence of defects, not their absence.
+2. Exhaustive testing is impossible; use risk and prioritisation.
+3. Early testing saves time and money — shift left.
+4. Defects cluster — concentrate effort where defect history is dense.
+5. Pesticide paradox — the same tests stop finding defects; refresh
+   assertions and add new techniques.
+6. Testing is context-dependent — safety-critical, regulated, web,
+   mobile, AI all warrant different mixes.
+7. Absence-of-errors fallacy — validate fitness for use, not just
+   code-level correctness.
+## ISTQB Advanced
+- **Test Manager:** risk-based testing (likelihood x impact), shaping
+  coverage to where risk is highest, accepting low-risk areas with
+  thinner tests, entry/exit criteria, test estimation.
+- **Test Analyst:** black-box / experience-based technique mastery,
+  tester independence, defect taxonomy.
+- **Technical Test Analyst:** white-box / structural coverage, code
+  analysis, performance / security / reliability test design.
+## ISO/IEC 25010 quality characteristics
+- functional suitability
+- performance efficiency
+- compatibility
+- usability
+- reliability
+- security
+- maintainability
+- portability
+Pick the characteristics the change actually threatens; do not list them all.
+## Test levels and the pyramid
+unit -> component integration -> system -> system integration -> acceptance.
+Shape the mix to the risk and the change.
+## Test types (orthogonal to levels)
+- functional
+- non-functional (performance, security, accessibility, reliability,
+  compatibility, usability)
+- white-box / structural
+- change-related (confirmation + regression)
+## Static testing
+Review (informal walkthrough, technical review, inspection) and static
+analysis (linters, type checkers, SAST). Often the cheapest defect removal.
+A code review or a stricter linter rule can be the right answer instead of
+"add more tests".
+## Senior-QA disciplines
+- Decide the SHAPE of the work first (single change vs repo-wide strategy;
+  bug vs greenfield vs refactor vs strengthen-existing-tests vs spike vs
+  config tweak vs docs). Wrong shape = wrong-shaped tests.
+- Reach for the smallest useful test set that gives release confidence.
+  Avoid generic advice; tie every recommendation to a specific risk.
+- When the user asks about strategy / audit / pyramid / rollout, that is a
+  strategy ask, not a single change. Don't force per-change output.
+- When the user describes work that doesn't change production code (mutation-
+  testing follow-up, raise-coverage, kill surviving mutants, tighten weak
+  assertions), do NOT scaffold tests against new behaviour. Strengthen
+  existing tests. Suppress equivalent mutants in tool config rather than
+  chasing them.
+- Critical paths (auth, authorization, payment, billing, encryption, rate
+  limiting, anything where a regression hits money, security, or customer
+  trust) warrant tighter coverage and at least one boundary test per rule.
+- Honest TDD: red phase first. Tests that fail BEFORE production code is
+  written. Never bless a change as merge-ready without evidence.
+- Static testing counts. A code review or a stricter linter rule can be
+  the right answer instead of "add more tests".

sumo_qa/_data/knowledge/specialty_tools.md ADDED Viewed

@@ -0,0 +1,54 @@
+# Specialty + tool fit — category primer
+Category-fit primer, **NOT** a brand whitelist. Each section says WHEN that
+specialty category applies. Brand names are illustrative — pick the best fit
+from your knowledge of the user's stack, verify currency via web search when
+uncertain. Once chosen, **install and set the tool up** (package manager,
+framework CLI, config edit, or MCP server — whichever path is shortest) and
+write the first tests against the named risks. Empty selection is acceptable.
+---
+**Token TTL / signature / claim validation** — JWT/JOSE/OAuth, signature
+verify, claims, expiry. Examples: JJWT, Auth0 java-jwt, jose4j.
+**HTTP DAST — new endpoint / auth filter** — header/CORS/auth bypasses on a
+real HTTP surface (not in-process pure functions). Examples: OWASP ZAP, Burp Suite.
+**Static security analysis** — hard-coded secrets, `alg=none`, SQLi, unsafe
+deserialisation. Examples: Semgrep, Snyk, SonarQube.
+**REST contract drift** — HTTP service with external consumers. Examples:
+Pact (consumer-driven), Spring Cloud Contract, Schemathesis (OpenAPI fuzzing).
+**Async / event contract drift** — handlers consuming events whose schemas
+may drift. Examples: Schemathesis, AsyncAPI test runners.
+**Frontend visual / interaction** — UI needing end-to-end browser coverage.
+Examples: Cypress, Playwright. (MCP servers exist for some; package-manager
+install is usually shorter.)
+**Frontend a11y** — keyboard nav, screen readers, ARIA, contrast. Examples:
+axe-core (often via Playwright), Pa11y.
+**Mobile UI** — mobile app surface. Examples: Appium, Maestro, Detox, XCUITest, Espresso.
+**Performance / load** — hot path with articulated SLO (p95, RPS). Without a
+budget it's theatre. Examples: k6, Locust, Gatling, JMeter.
+**Mutation testing** — coverage looks good but assertion strength suspect.
+Examples: Pitest (JVM), Stryker (JS/TS/.NET/Scala), MutPy / mutmut (Python).
+**Property-based** — invariant across many inputs (commutativity, idempotency,
+round-trip, monotonicity). Examples: Hypothesis (Python), jqwik (JVM),
+fast-check (JS/TS), ScalaCheck.
+**AI / LLM behaviour** — probabilistic surfaces: prompts, RAG, agents.
+Examples: Promptfoo, DeepEval, Ragas, TruLens, Evidently.
+---
+Discipline: pick by fit, not familiarity. Verify currency before naming.
+Once chosen, set the tool up yourself (install, config, scaffold the first
+tests) — don't hand the user a list of commands. Confirm before installing
+dependencies. Empty selection is honest; most changes don't need this.

sumo_qa/_data/knowledge/techniques.md ADDED Viewed

@@ -0,0 +1,87 @@
+# Test design techniques
+Pick one technique per named risk. The catalogue below is authoritative —
+do not invent techniques not in this list. If a risk needs something not
+catalogued, flag it as a gap rather than confabulating.
+## Black-box
+### equivalence partitioning
+Group inputs into classes that share behaviour; pick one representative
+per class. Use when the input space is large but partitions clearly.
+### boundary value analysis
+Test the values immediately above, at, and below boundaries (off-by-one,
+limits, capacity thresholds). Defects cluster at boundaries.
+### decision tables
+Enumerate every combination of input conditions and the expected output.
+Use when business rules are conjunctions of multiple conditions.
+### state transition testing
+Model the system as a finite state machine; test every legal transition
+and a sample of illegal ones. Use for stateful components.
+### pairwise / orthogonal arrays
+When a feature has many parameters with many values each, test every pair
+of value combinations rather than the full Cartesian product.
+### classification trees
+Hierarchical refinement of equivalence partitions, useful when partitions
+have sub-partitions.
+### use case testing
+Walk through end-to-end user scenarios. Catches integration defects unit
+tests miss.
+## White-box / structural
+### statement coverage
+Every statement executes at least once.
+### branch coverage
+Every branch (if/else, switch arm, loop entry/exit) is exercised both ways.
+### decision coverage
+Every boolean decision evaluates to both true and false.
+### MC-DC (modified condition / decision coverage)
+Each condition independently affects the decision outcome. Required for
+safety-critical software.
+### data-flow coverage
+Every definition-use pair is exercised.
+## Experience-based
+### error guessing
+Senior-QA judgment on where defects historically appear in similar systems.
+### exploratory testing charters
+Time-boxed, mission-driven sessions documenting findings.
+### checklist-based testing
+Reusable list of common pitfalls (e.g. OWASP Top 10).
+## Static
+### review (walkthrough / technical review / inspection)
+Code review with varying formality. Cheapest defect removal.
+### static analysis
+Linters, type checkers, SAST scanners. Catches whole classes of defects
+without execution.
+## Property-based
+### property-based testing
+Generate inputs satisfying invariants and check the output preserves the
+invariant. Tools: Hypothesis (Python), jqwik (JVM), fast-check (JS).
+Fits pure-function refactors and algorithms.
+## Mutation
+### mutation testing
+Mutate the production code, run the test suite, kill the mutants. Surfaces
+weak assertions. Tools: Pitest (JVM), Stryker (JS / .NET), mutmut (Python).
+Fits strengthen-test-coverage approach.

sumo_qa/_data/knowledge/test_data/auth/sample_accounts.yaml ADDED Viewed

@@ -0,0 +1,38 @@
+entries:
+  - id: auth-mfa-required-001
+    environment: integration
+    domain: auth
+    scenario_tags:
+      - mfa_required
+      - active_account
+    known_valid_for:
+      - mfa enforcement testing
+      - active account login
+    constraints:
+      - Reset MFA state after test execution.
+      - Revalidate before release testing if MFA policy changed.
+    owner: identity-platform
+    last_validated_at: "2026-05-06T09:00:00Z"
+    confidence: high
+    source: qa-curated
+    validation_source: mock-heuristic-validator
+    notes: Active account with enforced MFA for login-flow testing in integration.
+  - id: auth-locked-account-001
+    environment: integration
+    domain: auth
+    scenario_tags:
+      - account_locked
+      - failed_login
+    known_valid_for:
+      - locked account rejection
+      - failed-login lockout testing
+    constraints:
+      - Use a fresh impersonation token for each test run.
+      - Expected failure must be account-locked, not invalid credentials.
+    owner: identity-platform
+    last_validated_at: "2026-05-04T11:30:00Z"
+    confidence: high
+    source: qa-curated
+    validation_source: mock-heuristic-validator
+    notes: Useful for validating that locked-account rejection is not conflated with invalid credentials.

sumo_qa/_data/knowledge/test_data/billing/sample_invoices.yaml ADDED Viewed

@@ -0,0 +1,38 @@
+entries:
+  - id: billing-paid-invoice-001
+    environment: integration
+    domain: billing
+    scenario_tags:
+      - invoice_paid
+      - refund_eligible
+    known_valid_for:
+      - paid invoice refund flow
+      - invoice state validation
+    constraints:
+      - Reset payment state before re-run.
+      - Release any holds after test execution.
+    owner: billing-platform
+    last_validated_at: "2026-05-05T15:15:00Z"
+    confidence: high
+    source: qa-curated
+    validation_source: mock-heuristic-validator
+    notes: Known-good paid invoice for refund-flow testing in integration.
+  - id: billing-pending-due-boundary-001
+    environment: integration
+    domain: billing
+    scenario_tags:
+      - invoice_pending
+      - boundary_due_date
+    known_valid_for:
+      - pending invoice due-date boundary
+      - boundary-value validation
+    constraints:
+      - Confirm clock skew is within tolerance before release sign-off.
+      - Avoid if promotion testing is not intended.
+    owner: billing-platform
+    last_validated_at: "2026-04-20T10:00:00Z"
+    confidence: medium
+    source: qa-curated
+    validation_source: mock-heuristic-validator
+    notes: Stable pending invoice for due-date boundary checks; also covers boundary-value technique.

sumo_qa/_data/skills/qa-answering-testing-question/SKILL.md ADDED Viewed

@@ -0,0 +1,82 @@
+---
+name: qa-answering-testing-question
+description: Use when the user asks a generic testing question — "how do I test this?", "what should I check for X?" — that doesn't fit a more specific QA skill. Cites a principle or technique from the loaded catalogue rather than producing generic advice.
+---
+# Answering a testing question
+**Announce at start:** *"Answering with a cited principle and technique."*
+## Output discipline (mandatory)
+**Never surface internal taxonomy labels in user-facing output.** No "Classification: X", "Approach: Y", "Per the checklist", "Step 3 of 6". The taxonomy is internal scaffolding; translate to natural English when the meaning matters to the user — *"this is a behaviour change in pricing"*, not *"Classification: business_logic_change"*. If you catch yourself typing a label, delete it.
+## Output economy (mandatory)
+Spend output tokens on findings, not framing.
+- **Don't preamble the work.** The host already shows tool calls — present findings, don't narrate *"I'll first read X, then Y, then deliver Z."*
+- **One question per turn.** Don't follow a question with *"shall I proceed or clarify first?"* — the question IS the gate.
+- **No self-narration.** *"Let me now..."* / *"I'm going to..."* → just do it.
+- **Don't restate the user's input.** They know what they asked.
+- **Section headings only when there are genuinely multiple sections.** A 3-line scope check doesn't need a `## Scope` heading.
+- **Tables only when comparing >2 things on >2 axes.** Otherwise prose is shorter.
+- **No closing pleasantries.** No *"happy to dig deeper"* / *"let me know if you want X"* — the next-skill handoff at the bottom of every skill is where routing lives.
+## The Iron Law
+NO ANSWER WITHOUT A CITED PRINCIPLE OR TECHNIQUE.
+Every answer ties to a named ISTQB principle and a named test design technique from the loaded catalogue; specialty tools come from your ecosystem knowledge, with `specialty_tools.md` confirming the category fits.
+## When to Use
+`qa-deciding-approach` routes here when the user's intent is question-shaped but doesn't fit a more specific skill:
+- "how do I test this service?"
+- "what should I check for X feature?"
+- "any QA suggestions for this design?"
+- "what's the right test type for this?"
+For "create a plan" / "prep for work" / "review my changes" → use the more specific skills.
+## Checklist
+You MUST create a TodoWrite item per checklist item and complete in order:
+1. Read the user's question verbatim.
+2. Read any code/paths/specs the user supplied (host's file tools).
+3. Call `sumo_qa_load_principles()` and `sumo_qa_load_techniques()`. Read both catalogues.
+4. Identify the QA shape the question implies: what's the actual concern (correctness / regression / coverage / risk surface)?
+5. Pick at least one principle that shapes the answer (cite by number or name). Pick at least one technique that fits the concern.
+6. If the question implies a specialty surface (security, performance, contract, a11y, UI, etc.), recommend the best-fit tool from your knowledge of the ecosystem anchored to the user's stack. `sumo_qa_load_specialty_tools()` is a category-fit primer, NOT a brand whitelist. Verify currency with web search if uncertain. Offer to install and scaffold the first tests; confirm before installing dependencies.
+7. Synthesise the answer: 3-7 sentences, naming the principle/technique/tool. Conversational, not a JSON blob.
+8. If the question is actually a prep/plan/review/strategy in disguise, escalate: stop, route to the matching skill.
+## Process Flow
+See the Checklist above — that's the flow.
+## Red Flags
+| Thought | Reality |
+|---|---|
+| "Just say 'add unit tests and integration tests'" | Generic. Pick a technique from the catalogue (boundary value, decision table, etc.). |
+| "Mention security as a consideration" | Name the actual surface AND the right tool for it (HTTP DAST scanner / SAST tool / token-validation harness — pick from your knowledge by fit). Bare "consider security" is not senior-QA. |
+| "I'll cite a principle by paraphrasing — saves loading the catalogue" | Principles are catalogue-authoritative. Use the catalogue's wording. (Tool brand picks are different — those come from your knowledge of the ecosystem.) |
+| "I'll restrict tool recommendations to the names in `specialty_tools.md`" | The primer is a category check, not a brand whitelist. Recommend the best fit from your knowledge; the listed names are illustrative. |
+| "User asked a planning question — I'll answer inline" | Route to `qa-preparing-for-work` or `qa-creating-test-plan`. Don't reinvent. |
+| "Answer should be 20+ sentences for completeness" | 3-7 sentences. Senior QA answers concisely. |
+## Examples
+### Good
+User: "how should I test a new feature that re-orders user feeds?"
+Answer cites ISTQB Principle 4 (defects cluster — feed ordering is a hotspot), names decision-table for the ordering rules and equivalence-partitioning for feed sizes, suggests k6 if scale matters, and asks the user to confirm scale before adding performance work.
+### Bad
+"You should add unit tests, integration tests, and consider edge cases. Maybe test performance too." — no cited principle, no named technique, no specialty tool named by fit.
+## Next skill in the chain
+Terminal skill — the answer is the deliverable. If the question turns out to be a disguised plan / review / strategy ask, stop and route to the matching specific skill.

sumo_qa/_data/skills/qa-creating-test-plan/SKILL.md ADDED Viewed

@@ -0,0 +1,114 @@
+---
+name: qa-creating-test-plan
+description: Use when the user asks for a formal test plan, entry/exit criteria, or a phased QA approach for a piece of work. Walk the user through scope → risks → entry criteria → phases → exit criteria → residual risks one section at a time, getting confirmation before each step. Heavier than qa-preparing-for-work; use when the work is tracked or formally reviewed.
+---
+# Creating a Test Plan
+Help the user turn a piece of upcoming work into a phased ISTQB-style test plan through natural collaborative dialogue. Walk through scope, risks, criteria, and phases one section at a time, confirming with them after each, until the full plan is on the page. The user has domain context the AI can't infer — surface it through questions, don't assume it.
+**Announce at start:** *"Building the formal test plan."*
+## Output discipline (mandatory)
+**Never surface internal taxonomy labels in user-facing output.** No "Classification: X", "Approach: Y", "Per the checklist", "Step 3 of 6". The taxonomy is internal scaffolding; translate to natural English when the meaning matters to the user — *"this is a behaviour change in pricing"*, not *"Classification: business_logic_change"*. If you catch yourself typing a label, delete it.
+Inherits the global discipline from `using-sumo-qa` (knowledge authority hierarchy, internal scaffolding stays internal, specialty-tool fit).
+## Output economy (mandatory)
+Spend output tokens on findings, not framing.
+- **Don't preamble the work.** The host already shows tool calls — present findings, don't narrate *"I'll first read X, then Y, then deliver Z."*
+- **One question per turn.** Don't follow a question with *"shall I proceed or clarify first?"* — the question IS the gate.
+- **No self-narration.** *"Let me now..."* / *"I'm going to..."* → just do it.
+- **Don't restate the user's input.** They know what they asked.
+- **Section headings only when there are genuinely multiple sections.** A 3-line scope check doesn't need a `## Scope` heading.
+- **Tables only when comparing >2 things on >2 axes.** Otherwise prose is shorter.
+- **No closing pleasantries.** No *"happy to dig deeper"* / *"let me know if you want X"* — the next-skill handoff at the bottom of every skill is where routing lives.
+<HARD-GATE>
+Do NOT emit a test plan in a single message. Walk through the sections one at a time, getting the user's confirmation or correction between each. A test plan dumped in one turn is a wishlist; a test plan built collaboratively is reviewable.
+</HARD-GATE>
+## The Iron Law
+**NO PLAN WITHOUT EXPLICIT ENTRY AND EXIT CRITERIA.** A document missing either is a wishlist, not a plan.
+## When to Use
+User intents that trigger this skill:
+- "create a test plan for X"
+- "draft the formal QA plan I should follow"
+- "give me entry/exit criteria for X"
+- "I'm starting a major feature — plan QA properly"
+Distinct from `qa-preparing-for-work` (lighter prep brief, no formal entry/exit gates) — use this when the work is tracked, formally reviewed, or large enough to warrant phased execution.
+## Checklist
+You MUST work through these in order. Steps 1–3 are AI-only homework (no user questions). The user's confirmation gates steps 4 onward.
+1. **Extract scope hints from intent** *(no user question)* — re-read the user's intent verbatim. Identify keywords / paths / domain terms that point at where the work lives.
+2. **Walk the repo for the scope** *(no user question)* — use the host's file tools. Find where the production code lives, existing tests, related callers, classification signal. Don't ask the user where things are.
+3. **Load the catalogues** *(no user question)* — call `sumo_qa_load_standards`, `sumo_qa_load_rules`, `sumo_qa_load_techniques`, `sumo_qa_load_specialty_tools`. Internal only.
+4. **Confirm scope, only for the AMBIGUOUS parts** — present a short paragraph of what you FOUND (file paths, callers, existing tests). Then ask ONE focused question for whatever the code DIDN'T make clear. If exploration left nothing ambiguous, skip the question and move to step 5.
+5. **Propose named risks (one message, ask after)** — 3–7 named risks, each anchored in evidence you actually saw (file path, class name, domain term). NOT generic. Ask: *"do these match how you'd describe the risks?"*
+6. **Pick technique per risk** — name one technique per risk from the techniques catalogue. Present as a table: risk → technique. Ask: *"do these technique choices fit?"*
+7. **Recommend specialty tools (if any), and offer to set them up** — pick the best-fit tool from your knowledge of the ecosystem anchored to the user's stack. The primer is a category check, NOT a brand whitelist. Verify currency with web search if unsure. Offer to install and scaffold the first tests against the named risks. Confirm before installing dependencies. Empty list is acceptable.
+8. **Entry criteria — what must be true to START testing** — 3–5 observable preconditions (API spec frozen, test data loaded, feature flag default off, etc.).
+9. **Phases + deliverables** — propose analysis / design / execution / completion phases with concrete deliverables per phase.
+10. **Exit criteria — what must be true to SHIP** — observable exit criteria (all named risks have ≥1 passing test, no Sev-1/2 open, perf under p95 budget). Tautologies like "tests pass" are forbidden.
+11. **Residual risks accepted at exit** — name 1–3 risks you're NOT covering and why (out of scope, accepted cost, mitigated elsewhere).
+12. **Final plan** — assemble the confirmed sections into one document. Offer to write to a file (e.g. `docs/qa-plans/<topic>.md`) or surface inline. Confirm before writing.
+## Process Flow
+See the Checklist above — that's the flow.
+## Red Flags — STOP and rework
+| Thought | Reality |
+|---|---|
+| "I'll draft the whole plan and they can react" | Iron-Law-adjacent failure. Walk through one section at a time. |
+| "I'll ask the user where the code lives instead of looking" | The code is in the repo. Read it. Ask only for what the code didn't make clear. |
+| "I'll list 4 clarifying questions in one message to be thorough" | Overwhelming. ONE primary question per turn. |
+| "Skip exit criteria — they'll know when it's done" | Then it's not a plan. Iron Law violated. |
+| "Entry criteria: 'tests are green'" | Tautology. Entry criteria are about the world before testing — feature complete, data available, environments stand up. |
+| "Add a phase called 'edge cases'" | Phases are analysis / design / execution / completion. "Edge cases" is a phase only in a junior QA's plan. |
+| "Residual risks: 'none'" | Every plan has residual risks. Naming "none" means you didn't think about what could still go wrong post-ship. |
+| "Mutation testing on a UI redesign" | Wrong tool fit. Pick from the catalogue based on the actual risk surface. |
+| "Tests cover all behaviour" | "All behaviour" is not measurable. Exit criteria must be observable. |
+| "User said create a plan — I'll just create it without asking" | The user has context you don't. Walk through the sections; their corrections shape the plan. |
+## Examples
+### Good (explore first, ask only what's ambiguous)
+> **User:** "Create a test plan for the new tax-calculation feature."
+>
+> **AI (after internal repo walk):** *"Found the tax engine at `tax/TaxEngine.kt` with jurisdiction handlers under `tax/jurisdictions/`. Called from `pricing/Calculator.kt` upstream; `invoicing/InvoiceBuilder.kt` consumes it. Existing tests cover 4 jurisdictions. **Is scope just the tax engine, or also the upstream pricing-pipeline integration?**"*
+### Bad (single-shot dump)
+> **AI:** *"Here's a plan: Phases — planning, testing, deployment. Tests — happy path, edge cases, integration. Entry — code complete. Exit — tests pass. Done."*
+>
+> Generic phases, no risks named, tautological exit, no collaboration. Iron Law violated.
+## Next skill in the chain
+When the plan is signed off → `qa-planning-qa-rollout` to break the phases into bite-sized, dispatchable tasks ready for subagent execution.
+If the user wants to act on a single phase directly rather than dispatch it → route to the matching execution skill instead (`qa-implementing-with-tdd` for new behaviour / regressions, `qa-strengthening-tests` for mutation follow-up, `qa-reviewing-before-merge` for review-shaped phases).