npm - @harness-engineering/cli - Versions diffs - 1.13.0 → 1.13.1 - Mend

@harness-engineering/cli 1.13.0 → 1.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (267) hide show

package/dist/agents/skills/gemini-cli/harness-integration-test/SKILL.md ADDED Viewed

@@ -0,0 +1,271 @@
+# Harness Integration Test
+> Service boundary testing, API contract verification, and consumer-driven contract validation. Ensures services communicate correctly without requiring full end-to-end infrastructure.
+## When to Use
+- Testing API endpoints with real HTTP requests against a running service
+- Validating consumer-driven contracts between microservices (Pact, Spring Cloud Contract)
+- Verifying database interactions through repository or data access layers
+- NOT when testing pure business logic with no I/O (use unit tests or harness-tdd instead)
+- NOT when testing full user flows through a browser (use harness-e2e instead)
+- NOT when performing load or stress testing on APIs (use harness-load-testing instead)
+## Process
+### Phase 1: DISCOVER -- Map Service Boundaries and Dependencies
+1. **Identify service boundaries.** Scan the project structure for:
+   - API route definitions (Express routers, FastAPI endpoints, Spring controllers, Go HTTP handlers)
+   - Service client code (HTTP clients, gRPC stubs, message queue publishers/consumers)
+   - Shared type definitions or API schemas (OpenAPI specs, proto files, GraphQL schemas)
+2. **Map inter-service dependencies.** For each service, catalog:
+   - Upstream dependencies: services this service calls
+   - Downstream consumers: services that call this service
+   - Shared resources: databases, message queues, caches
+3. **Inventory existing integration tests.** Glob for test files in `tests/integration/`, `__integration__/`, `tests/api/`, and `contract-tests/`. Classify by type:
+   - API tests: send HTTP requests and assert responses
+   - Contract tests: verify provider/consumer agreements
+   - Repository tests: test data access against a real database
+4. **Identify coverage gaps.** Cross-reference discovered endpoints and service boundaries against existing tests. Flag untested:
+   - API endpoints with no request/response validation
+   - Service boundaries with no contract tests
+   - Error scenarios (4xx responses, timeout handling, retry behavior)
+5. **Select test strategy.** Based on the architecture:
+   - Monolith: API tests with supertest/httptest against the running application
+   - Microservices: consumer-driven contract tests with Pact plus API tests per service
+   - Event-driven: message contract tests plus async handler integration tests
+### Phase 2: MOCK -- Configure Test Doubles and Infrastructure
+1. **Set up test database.** Choose the appropriate strategy:
+   - **Testcontainers:** Spin up a real database in Docker for each test suite. Preferred for PostgreSQL, MySQL, MongoDB.
+   - **In-memory database:** SQLite in-memory for lightweight tests. Only when schema compatibility is confirmed.
+   - **Transaction rollback:** Wrap each test in a transaction and roll back. Fast but requires careful connection management.
+2. **Configure mock services for external dependencies.** For each upstream dependency:
+   - Create a mock server using the framework's built-in tools (Pact mock, WireMock, nock, MSW)
+   - Define request/response pairs from the API contract or OpenAPI spec
+   - Configure realistic error responses (500, 503, timeout) for error path testing
+3. **Set up contract broker (if using Pact).** Configure:
+   - Pact broker URL and authentication
+   - Consumer and provider version tagging strategy
+   - Webhook configuration for provider verification on deploy
+4. **Create test fixtures and seed data.** Generate:
+   - Database seed scripts for required reference data
+   - Request/response fixtures for common API payloads
+   - Factory functions for building test entities with sensible defaults
+5. **Verify mock infrastructure starts.** Run a smoke test that:
+   - Starts the test database and confirms connectivity
+   - Starts mock services and confirms they respond
+   - Seeds baseline data and confirms it is queryable
+### Phase 3: IMPLEMENT -- Write Integration Tests
+1. **Write API endpoint tests.** For each endpoint, test:
+   - Happy path: valid request returns expected response with correct status code and body
+   - Validation: invalid input returns 400 with descriptive error messages
+   - Authentication: unauthenticated requests return 401, unauthorized return 403
+   - Not found: requests for non-existent resources return 404
+   - Edge cases: empty collections, pagination boundaries, large payloads
+2. **Write consumer-driven contract tests (when applicable).** For each consumer-provider pair:
+   - Consumer side: define interactions (request/response pairs) the consumer expects
+   - Provider side: verify the provider satisfies all consumer contracts
+   - Use Pact matchers for flexible verification (type matching, regex, array-like)
+3. **Write repository/data access tests.** For each data access layer:
+   - CRUD operations with valid data
+   - Constraint violations (unique, foreign key, not-null)
+   - Query correctness (filters, sorting, pagination)
+   - Transaction behavior (isolation, rollback on error)
+4. **Write error handling and resilience tests.** Verify:
+   - Timeout behavior: service responds within SLA when dependency is slow
+   - Retry logic: transient failures trigger retries with backoff
+   - Circuit breaker: repeated failures open the circuit and return fallback
+   - Graceful degradation: partial dependency failure does not crash the service
+5. **Organize tests by execution speed.** Tag or separate:
+   - Fast integration tests (in-memory mocks, < 5 seconds): run on every commit
+   - Slow integration tests (testcontainers, external services, > 5 seconds): run on PR
+### Phase 4: VALIDATE -- Execute and Verify Contract Compliance
+1. **Run the full integration test suite.** Execute all tests with verbose output. Collect:
+   - Pass/fail counts per test category (API, contract, repository)
+   - Execution time per test and per suite
+   - Any tests that require external services to be running
+2. **Verify contract compliance.** If using Pact:
+   - Publish consumer pacts to the broker
+   - Run provider verification against published pacts
+   - Confirm the can-i-deploy check passes for the target environment
+3. **Validate test isolation.** Run tests in random order (if the framework supports it). Any test that fails only when run after a specific other test has a shared-state bug. Fix immediately.
+4. **Run `harness validate`.** Confirm the project passes all harness checks including the new integration test infrastructure.
+5. **Generate coverage report.** Summarize:
+   - Endpoints tested vs. total endpoints discovered
+   - Contract coverage: consumer-provider pairs with verified contracts
+   - Error scenarios covered vs. identified
+   - Recommended next steps for remaining gaps
+### Graph Refresh
+If a knowledge graph exists at `.harness/graph/`, refresh it after code changes to keep graph queries accurate:
+```
+harness scan [path]
+```
+## Harness Integration
+- **`harness validate`** -- Run in VALIDATE phase after all integration tests are implemented. Confirms project-wide health.
+- **`harness check-deps`** -- Run after MOCK phase to verify test infrastructure dependencies do not leak into production bundles.
+- **`emit_interaction`** -- Used at checkpoints to present contract verification results and coverage gaps to the human.
+- **Grep** -- Used in DISCOVER phase to find route definitions, HTTP client usage, and service boundary patterns.
+- **Glob** -- Used to catalog existing integration tests and contract files.
+## Success Criteria
+- Every API endpoint has at least one integration test covering the happy path
+- Every consumer-provider boundary has a verified contract (when using microservices)
+- Error scenarios (400, 401, 403, 404, 500, timeout) are tested for all public endpoints
+- All integration tests pass with test doubles -- no dependency on external staging environments for CI
+- Test isolation is verified: tests pass in any execution order
+- `harness validate` passes with the integration test suite in place
+## Examples
+### Example: Express API with Supertest and Testcontainers
+**DISCOVER output:**
+```
+Framework: Express 4.18 with TypeScript
+Database: PostgreSQL via Prisma
+Endpoints: 14 routes across 4 controllers (users, projects, tasks, auth)
+Existing tests: 3 integration tests in tests/integration/ (auth only)
+Coverage gaps: projects CRUD, tasks filtering, user profile update
+```
+**IMPLEMENT -- API endpoint test with supertest:**
+```typescript
+// tests/integration/projects.test.ts
+import request from 'supertest';
+import { app } from '../../src/app';
+import { prisma } from '../../src/db';
+import { createTestUser, generateAuthToken } from '../helpers/auth';
+describe('POST /api/projects', () => {
+  let authToken: string;
+  beforeAll(async () => {
+    const user = await createTestUser(prisma);
+    authToken = generateAuthToken(user.id);
+  });
+  afterAll(async () => {
+    await prisma.project.deleteMany();
+    await prisma.user.deleteMany();
+  });
+  it('creates a project with valid data', async () => {
+    const response = await request(app)
+      .post('/api/projects')
+      .set('Authorization', `Bearer ${authToken}`)
+      .send({ name: 'Test Project', description: 'Integration test' });
+    expect(response.status).toBe(201);
+    expect(response.body).toMatchObject({
+      name: 'Test Project',
+      description: 'Integration test',
+    });
+    expect(response.body.id).toBeDefined();
+  });
+  it('returns 400 when name is missing', async () => {
+    const response = await request(app)
+      .post('/api/projects')
+      .set('Authorization', `Bearer ${authToken}`)
+      .send({ description: 'No name' });
+    expect(response.status).toBe(400);
+    expect(response.body.errors).toContainEqual(expect.objectContaining({ field: 'name' }));
+  });
+  it('returns 401 without auth token', async () => {
+    const response = await request(app).post('/api/projects').send({ name: 'Unauthorized' });
+    expect(response.status).toBe(401);
+  });
+});
+```
+### Example: Pact Consumer-Driven Contract Test
+**IMPLEMENT -- Consumer side (frontend):**
+```typescript
+// contract-tests/consumer/project-service.pact.ts
+import { PactV3, MatchersV3 } from '@pact-foundation/pact';
+import { ProjectClient } from '../../src/clients/project-client';
+const { like, eachLike, uuid } = MatchersV3;
+const provider = new PactV3({
+  consumer: 'Dashboard',
+  provider: 'ProjectService',
+});
+describe('ProjectService contract', () => {
+  it('returns a list of projects', async () => {
+    await provider
+      .given('projects exist for user')
+      .uponReceiving('a request for user projects')
+      .withRequest({
+        method: 'GET',
+        path: '/api/projects',
+        headers: { Authorization: like('Bearer token-123') },
+      })
+      .willRespondWith({
+        status: 200,
+        body: eachLike({
+          id: uuid(),
+          name: like('Project Alpha'),
+          createdAt: like('2026-01-15T10:30:00Z'),
+        }),
+      })
+      .executeTest(async (mockServer) => {
+        const client = new ProjectClient(mockServer.url);
+        const projects = await client.listProjects('token-123');
+        expect(projects).toHaveLength(1);
+        expect(projects[0].name).toBe('Project Alpha');
+      });
+  });
+});
+```
+## Gates
+- **No integration tests that require external staging environments for CI.** Every integration test must run with local test doubles (mocks, containers, in-memory databases). Tests that fail without a staging VPN are not integration tests -- they are environment tests.
+- **No shared mutable state between tests.** Each test must set up and tear down its own data. If tests fail when run in random order, shared state exists. Fix it before proceeding.
+- **No testing implementation details.** Integration tests assert on API contracts (status codes, response shapes, headers) and observable data changes -- not on internal function calls or database column values that are not part of the public contract.
+- **Contract changes must be coordinated.** If a provider contract test reveals a breaking change, do not silently update the consumer expectation. Flag it as a coordination point between teams.
+## Escalation
+- **When a service dependency has no API documentation or schema:** Cannot write accurate contract tests without knowing the contract. Escalate to the dependency team to provide an OpenAPI spec, proto file, or at minimum a Pact broker with published contracts.
+- **When Testcontainers fails in CI (Docker-in-Docker issues, resource limits):** Fall back to in-memory alternatives where possible. For databases that have no in-memory mode, escalate to DevOps to configure CI runners with Docker support.
+- **When contract verification fails on the provider side:** This indicates a real incompatibility between consumer expectations and provider implementation. Do not adjust the consumer test to match the provider bug. Escalate to the provider team with the failing interaction details.
+- **When integration tests exceed 5 minutes for a single service:** Triage by separating fast tests (mocked dependencies) from slow tests (testcontainers). Run fast tests on every commit, slow tests on PR only.

package/dist/agents/skills/gemini-cli/harness-integration-test/skill.yaml ADDED Viewed

@@ -0,0 +1,73 @@
+name: harness-integration-test
+version: "1.0.0"
+description: Service boundary testing, API integration testing, and consumer-driven contract validation
+cognitive_mode: meticulous-verifier
+triggers:
+  - manual
+  - on_new_feature
+  - on_pr
+platforms:
+  - claude-code
+  - gemini-cli
+tools:
+  - Bash
+  - Read
+  - Write
+  - Edit
+  - Glob
+  - Grep
+  - emit_interaction
+cli:
+  command: harness skill run harness-integration-test
+  args:
+    - name: path
+      description: Project root path
+      required: false
+    - name: scope
+      description: "Test scope: api, contract, or full. Defaults to full."
+      required: false
+    - name: service
+      description: "Target service name for focused testing. Tests all services when omitted."
+      required: false
+mcp:
+  tool: run_skill
+  input:
+    skill: harness-integration-test
+    path: string
+type: rigid
+tier: 3
+internal: false
+keywords:
+  - integration test
+  - contract test
+  - Pact
+  - service boundary
+  - API test
+  - mock service
+  - test container
+  - supertest
+  - httptest
+  - consumer-driven
+stack_signals:
+  - "tests/integration/"
+  - "src/**/__integration__/"
+  - "pact/"
+  - "contract-tests/"
+  - "tests/api/"
+phases:
+  - name: discover
+    description: Map service boundaries, API endpoints, and inter-service dependencies
+    required: true
+  - name: mock
+    description: Configure mock services, test containers, and contract stubs
+    required: true
+  - name: implement
+    description: Write integration tests covering API contracts, error scenarios, and boundary conditions
+    required: true
+  - name: validate
+    description: Execute tests against real or containerized services and verify contract compliance
+    required: true
+state:
+  persistent: false
+  files: []
+depends_on: []

package/dist/agents/skills/gemini-cli/harness-integrity/skill.yaml CHANGED Viewed

@@ -28,6 +28,7 @@ mcp:
     skill: harness-integrity
     path: string
 type: rigid
+tier: 2
 cognitive_mode: meticulous-verifier
 phases:
   - name: verify

package/dist/agents/skills/gemini-cli/harness-knowledge-mapper/skill.yaml CHANGED Viewed

@@ -30,6 +30,7 @@ mcp:
     skill: harness-knowledge-mapper
     path: string
 type: rigid
+internal: true
 phases:
   - name: survey
     description: Query graph for module structure and topology

package/dist/agents/skills/gemini-cli/harness-load-testing/SKILL.md ADDED Viewed

@@ -0,0 +1,274 @@
+# Harness Load Testing
+> Stress testing, capacity planning, and performance benchmarking with k6, Artillery, and Gatling. Detects existing load test infrastructure, designs test scenarios for critical paths, executes tests, and analyzes results against defined thresholds.
+## When to Use
+- Before major releases or migrations to validate capacity and identify breaking points
+- At milestone boundaries to establish performance baselines for critical endpoints
+- When scaling infrastructure and needing to verify new capacity meets demand projections
+- NOT for unit-level microbenchmarks (use harness-perf for function-level performance)
+- NOT for real-time production monitoring (use observability tooling like Datadog or Grafana)
+- NOT for frontend performance audits (use Lighthouse or harness-perf for client-side metrics)
+## Process
+### Phase 1: DETECT -- Inventory Existing Tests and Infrastructure
+1. **Discover load testing tooling.** Scan the project for load test infrastructure:
+   - k6: `*.k6.js`, `k6/` directory, `import { check } from 'k6'` patterns
+   - Artillery: `artillery.yml`, `artillery/` directory, `config.target` in YAML files
+   - Gatling: `gatling/`, `*Simulation.scala`, `gatling.conf`
+   - JMeter: `*.jmx` files, `jmeter/` directory
+   - Custom: `load-tests/`, `perf/`, `benchmark/` directories
+2. **Inventory existing test scenarios.** For each discovered test file:
+   - Extract target endpoints and their HTTP methods
+   - Identify virtual user counts, ramp-up profiles, and duration
+   - Note defined thresholds (p95, p99, error rate, throughput)
+   - Check for test data generation or seed data requirements
+3. **Map critical endpoints.** Identify endpoints that should be load tested:
+   - High-traffic endpoints from route definitions and API documentation
+   - Endpoints involved in revenue-critical flows (checkout, payment, subscription)
+   - Endpoints with known performance sensitivity (search, aggregation, file upload)
+   - Recently changed endpoints from `git log --oneline --since="30 days ago"`
+4. **Identify coverage gaps.** Compare critical endpoints against existing test scenarios:
+   - Endpoints with no load test coverage
+   - Tests with outdated thresholds or missing scenarios (spike, soak)
+   - Tests that do not match current API contracts (changed parameters, new endpoints)
+5. **Check test infrastructure.** Verify the testing environment:
+   - Target environment configuration (staging URL, test database, mock services)
+   - CI/CD integration (is load testing part of the pipeline?)
+   - Test data availability (seed scripts, fixtures, data generators)
+---
+### Phase 2: DESIGN -- Define Scenarios and Thresholds
+1. **Select test profiles.** Design scenarios for each critical endpoint based on the testing goal:
+   - **Smoke test:** 1-5 VUs for 1 minute. Validates the test script works correctly.
+   - **Load test:** Expected production traffic for 5-15 minutes. Validates normal operation.
+   - **Stress test:** 2-3x expected traffic with ramp-up. Finds the breaking point.
+   - **Spike test:** Sudden burst from baseline to 10x traffic. Tests auto-scaling and recovery.
+   - **Soak test:** Expected traffic for 1-4 hours. Detects memory leaks and connection pool exhaustion.
+2. **Define virtual user profiles.** Model realistic user behavior:
+   - Think time between requests (sleep 1-3 seconds between page interactions)
+   - Request sequences that mirror real user journeys (browse -> search -> add to cart -> checkout)
+   - Authentication flows (login, token refresh, session maintenance)
+   - Data variation (different product IDs, user accounts, search queries)
+3. **Set performance thresholds.** Define pass/fail criteria per endpoint:
+   - **Latency:** p95 < 500ms, p99 < 1000ms (adjust per endpoint SLO)
+   - **Error rate:** < 1% for load tests, < 5% for stress tests
+   - **Throughput:** Minimum requests per second for the target VU count
+   - Thresholds should be derived from SLOs if they exist, or from baseline measurements
+4. **Generate test scripts.** Produce test files in the detected tool format:
+   - k6: JavaScript test with `stages`, `thresholds`, and `checks`
+   - Artillery: YAML config with `phases`, `ensure`, and scenario `flow`
+   - Include parameterized data, proper assertions, and tagged metrics
+5. **Design ramp-up stages.** For each profile, define the VU ramp schedule:
+   - Gradual ramp-up to avoid thundering herd at test start
+   - Plateau at target load for measurement stability
+   - Optional step-up stages for stress tests to find the exact breaking point
+   - Cool-down period to observe recovery behavior
+---
+### Phase 3: EXECUTE -- Run Tests and Collect Metrics
+1. **Validate test environment.** Before executing:
+   - Confirm the target URL is a staging/test environment (never run load tests against production without explicit approval)
+   - Verify test database is seeded and isolated
+   - Check that external dependencies are stubbed or rate-limited appropriately
+   - Confirm monitoring is active on the target environment to correlate metrics
+2. **Run smoke test first.** Execute each test script with minimal load:
+   - 1-2 VUs for 30 seconds
+   - Verify all requests return expected status codes
+   - Confirm test data and authentication work correctly
+   - If smoke test fails, fix the script before proceeding to load tests
+3. **Execute the selected test profile.** Run the load test and capture:
+   - Request latency distribution (p50, p90, p95, p99, max)
+   - Throughput (requests per second over time)
+   - Error rate and error type distribution (4xx vs 5xx)
+   - Virtual user count over time
+   - If using k6: `k6 run --out json=results.json test.k6.js`
+   - If using Artillery: `artillery run --output report.json artillery.yml`
+4. **Monitor system resources during execution.** If accessible:
+   - CPU and memory utilization on target servers
+   - Database connection pool usage and query times
+   - Queue depths and consumer lag
+   - Network I/O and connection counts
+5. **Capture baseline or compare to previous run.** Store results for trend analysis:
+   - Save raw results to `load-tests/results/YYYY-MM-DD-<profile>.json`
+   - If a previous baseline exists, prepare a comparison dataset
+   - Tag results with git commit SHA for traceability
+---
+### Phase 4: ANALYZE -- Interpret Results and Report
+1. **Evaluate against thresholds.** For each endpoint tested:
+   - Did p95 latency meet the threshold? If not, by how much?
+   - Did error rate stay within acceptable bounds?
+   - Did throughput meet the minimum target?
+   - At what VU count did performance degrade (stress test breaking point)?
+2. **Identify bottlenecks.** Correlate performance data with system metrics:
+   - High latency + high CPU = compute-bound (optimize algorithm or scale horizontally)
+   - High latency + normal CPU = I/O-bound (database queries, external API calls, disk)
+   - Increasing errors at specific VU count = resource exhaustion (connection pools, file descriptors, memory)
+   - Latency spike at ramp-up = cold start issue (JIT compilation, cache warming, connection establishment)
+3. **Calculate capacity projections.** Based on stress test results:
+   - Maximum sustainable throughput before p95 exceeds threshold
+   - Current headroom as percentage of maximum capacity
+   - Projected capacity needed for growth targets (e.g., 2x traffic in 6 months)
+   - Scaling recommendation: vertical (bigger instances) vs horizontal (more instances)
+4. **Compare to baseline.** If previous results exist:
+   - Latency delta per endpoint (regression or improvement)
+   - Throughput delta (higher or lower capacity)
+   - Error rate delta
+   - Flag any regressions greater than 10% as requiring investigation
+5. **Produce the load test report.** Output a structured summary:
+   ```
+   Load Test Report: <profile> — <date>
+   Target: <URL> | Tool: <k6/Artillery/Gatling> | Duration: <time>
+   Commit: <SHA>
+   Results:
+     Endpoint          | p50    | p95    | p99    | Errors | RPS   | Status
+     GET /api/products | 45ms   | 120ms  | 340ms  | 0.1%   | 850   | PASS
+     POST /api/orders  | 180ms  | 520ms  | 1100ms | 0.8%   | 120   | WARN (p99 > 1000ms)
+     GET /api/search   | 95ms   | 680ms  | 2100ms | 2.3%   | 340   | FAIL (p99 > 1000ms, errors > 1%)
+   Capacity: Max sustainable 1200 RPS at p95 < 500ms (current production: ~400 RPS, 3x headroom)
+   Bottleneck: /api/search becomes I/O-bound at 500 RPS — database full-text search query
+   Recommendation: Add search index or migrate to Elasticsearch for /api/search
+   ```
+6. **Archive results.** Save the full report and raw data for historical comparison.
+---
+## Harness Integration
+- **`harness skill run harness-load-testing`** -- Primary CLI entry point. Runs all four phases.
+- **`harness validate`** -- Run after generating test scripts to verify project structure.
+- **`harness check-deps`** -- Verify load testing tool dependencies are declared (k6, artillery npm package).
+- **`emit_interaction`** -- Used before execution (checkpoint:human-verify) to confirm target environment and test profile.
+- **`Glob`** -- Discover existing load test files, result archives, and configuration.
+- **`Grep`** -- Search for endpoint definitions, route handlers, and threshold configurations.
+- **`Write`** -- Generate load test scripts and result reports.
+- **`Edit`** -- Update existing test scripts with new scenarios or adjusted thresholds.
+## Success Criteria
+- All critical endpoints are identified and have corresponding load test scenarios
+- Test scripts are syntactically valid and pass a smoke test before full execution
+- Thresholds are derived from SLOs or documented baseline measurements, not arbitrary values
+- Results include latency percentiles (p50, p95, p99), error rates, and throughput
+- Bottlenecks are identified with specific causes (CPU, I/O, connection pool, memory)
+- Capacity projections include headroom percentage and scaling recommendations
+## Examples
+### Example: k6 Load Test for Express.js REST API
+```
+Phase 1: DETECT
+  Tool: k6 (found k6/ directory with 2 existing scripts)
+  Existing tests:
+    - k6/smoke-api.k6.js: GET /api/health (1 VU, 10s)
+    - k6/load-products.k6.js: GET /api/products (50 VUs, 5m, p95 < 300ms)
+  Coverage gaps:
+    - POST /api/orders — revenue-critical, no load test
+    - GET /api/search — high-traffic, no load test
+    - No stress or soak test profiles exist
+Phase 2: DESIGN
+  New scenarios:
+    - k6/load-orders.k6.js: POST /api/orders
+      Stages: ramp to 100 VUs over 2m, hold 5m, ramp down 1m
+      Thresholds: p95 < 800ms, errors < 1%, RPS > 80
+    - k6/stress-api.k6.js: All endpoints
+      Stages: step-up from 50 to 500 VUs in 50-VU increments, 2m per step
+      Thresholds: find breaking point, record p95 at each step
+    - k6/soak-api.k6.js: Critical endpoints at expected load
+      Duration: 2 hours at 200 VUs
+      Thresholds: p95 < 500ms, memory growth < 50MB/hour
+Phase 3: EXECUTE
+  Environment: https://staging.example.com (verified non-production)
+  Smoke: All scripts pass with 1 VU
+  Load test results captured to load-tests/results/2026-03-27-load.json
+Phase 4: ANALYZE
+  Results: POST /api/orders p95=620ms (PASS), GET /api/search p99=2100ms (FAIL)
+  Bottleneck: Full-text search on PostgreSQL LIKE query at 300+ RPS
+  Capacity: 800 RPS sustainable, current production 250 RPS (3.2x headroom)
+  Recommendation: Add pg_trgm index or migrate search to Elasticsearch
+```
+### Example: Artillery Test for NestJS GraphQL API
+```
+Phase 1: DETECT
+  Tool: Artillery (found artillery.yml with 1 scenario)
+  Existing: query { products } at 20 RPS for 60s
+  Gaps: mutations not tested, no spike profile, thresholds not defined
+Phase 2: DESIGN
+  New config: artillery/graphql-load.yml
+    Phases:
+      - warm-up: 5 RPS for 30s
+      - load: 50 RPS for 5m
+      - spike: jump to 200 RPS for 30s, back to 50 RPS
+    Scenarios:
+      - query { products(limit: 20) } — 60% weight
+      - mutation { createOrder(input: $input) } — 25% weight
+      - query { user(id: $id) { orders } } — 15% weight
+    Ensure:
+      - p99 < 1500ms
+      - maxErrorRate < 2
+Phase 3: EXECUTE
+  Target: http://staging.internal:3000/graphql
+  Smoke: PASS (all queries resolve, auth tokens valid)
+  Full run: artillery run --output report.json artillery/graphql-load.yml
+Phase 4: ANALYZE
+  Results:
+    query products: p95=89ms, p99=210ms — PASS
+    mutation createOrder: p95=340ms, p99=890ms — PASS
+    query user.orders: p95=520ms, p99=2400ms — FAIL
+  Bottleneck: N+1 query in user.orders resolver (no DataLoader)
+  Spike recovery: System recovered to baseline within 15s after spike — PASS
+  Recommendation: Add DataLoader for orders resolver, re-test after fix
+```
+## Gates
+- **No load tests against production without explicit human approval.** Load tests can cause real outages. The target environment must be verified as non-production before execution. If production testing is required, a `[checkpoint:human-verify]` must be passed with documented approval.
+- **No test execution with failing smoke tests.** If the smoke test fails, the test script is broken. Fix the script before running at load. Running broken scripts at scale wastes time and produces meaningless results.
+- **No capacity claims without stress test data.** Capacity projections require finding the actual breaking point. A load test that passes at the expected level does not establish maximum capacity. Stress test data is required for credible projections.
+- **No threshold changes without documented justification.** If thresholds are relaxed from a previous baseline, the reason must be documented (e.g., "p95 threshold increased from 300ms to 500ms due to added encryption overhead per SEC-2026-04 requirement").
+## Escalation
+- **When the staging environment does not match production configuration:** Load test results against a differently-sized environment are misleading. Report: "Staging has [N] instances with [M] CPU/RAM. Production has [X] instances with [Y] CPU/RAM. Results should be scaled by a factor of [ratio], but this is approximate. Recommend matching staging to production configuration for accurate capacity planning."
+- **When external dependencies cannot be stubbed for load testing:** Real external APIs may rate-limit or block load test traffic. Report: "The [dependency] cannot be stubbed and will rate-limit at [N] RPS. Options: (1) use a mock service like WireMock, (2) test with a dedicated sandbox API key, (3) test only the internal service layer with the external call short-circuited."
+- **When load test results are inconsistent between runs:** Noisy results indicate environmental interference. Report: "Variance between runs exceeds 15% for [endpoint]. Possible causes: shared staging environment, garbage collection pauses, network congestion. Recommend running 3 iterations and using the median, or running on a dedicated isolated environment."
+- **When the breaking point is below expected production traffic:** This is a capacity emergency. Report: "The system breaks at [N] RPS but production receives [M] RPS. Immediate scaling action required. Short-term: increase instance count by [factor]. Long-term: address the identified bottleneck in [component]."