npm - @allenpan2026/harshjudge - Versions diffs - 0.4.4 → 0.4.5 - Mend

@allenpan2026/harshjudge 0.4.4 → 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/.claude-plugin/plugin.json +1 -1
package/package.json +1 -1
package/skills/harshjudge/SKILL.md +26 -27
package/skills/harshjudge/assets/prd.md +9 -10
package/skills/harshjudge/references/create.md +32 -2
package/skills/harshjudge/references/iterate.md +19 -12
package/skills/harshjudge/references/run-step-agent.md +25 -13
package/skills/harshjudge/references/run-tools.md +95 -0
package/skills/harshjudge/references/run.md +2 -2
package/skills/harshjudge/references/run-browser.md +0 -63

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harshjudge",
-  "version": "0.4.4",
+  "version": "0.4.5",
   "description": "AI-native E2E testing orchestration for Claude Code",
   "author": {
     "name": "Allen Pan"

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@allenpan2026/harshjudge",
-  "version": "0.4.4",
+  "version": "0.4.5",
   "description": "AI-native E2E testing orchestration CLI for Claude Code.",
   "type": "module",
   "main": "./dist/index.js",

package/skills/harshjudge/SKILL.md CHANGED Viewed

@@ -1,11 +1,11 @@
 ---
 name: harshjudge
-description: AI-native E2E testing orchestration for Claude Code. Use when creating, running, or managing end-to-end test scenarios with visual evidence capture. Activates for tasks involving E2E tests, browser automation testing, test scenario creation, test execution with screenshots, or checking test status.
+description: E2E testing orchestration for Claude Code. Use when creating, running, or managing end-to-end test scenarios — frontend (browser), backend (API), or CLI. Activates for tasks involving E2E tests, test scenario creation, test execution with evidence capture, or checking test status.
 ---
 # HarshJudge E2E Testing
-AI-native E2E testing with CLI commands and visual evidence capture.
+AI-native E2E testing with CLI commands and evidence capture.
 ## CLI Setup
@@ -20,7 +20,7 @@ alias harshjudge="npx @allenpan2026/harshjudge@latest"
 ## Core Principles
-1. **Evidence First**: Screenshot before and after every action
+1. **Evidence First**: Capture evidence appropriate to the step type — screenshots for frontend, response bodies for API, stdout for CLI
 2. **Fail Fast**: Stop on error, report with context
 3. **Complete Runs**: Always call `harshjudge complete-run`, even on failure
 4. **Step Isolation**: Each step executes in its own spawned agent for token efficiency
@@ -108,27 +108,19 @@ Main Agent                    Step Agents (spawned per step)
 | `harshjudge discover search <pattern>` | Search file content |
 | `harshjudge dashboard open/close/status` | Manage dashboard server |
-### Browser Automation (Auto-Detect)
+### Step Types
-Before running a scenario, detect which browser tool is available by checking for these tools in order:
+Each step declares its execution mode via `type` in the step file frontmatter:
-1. **Playwright MCP** — look for `browser_navigate` tool
-2. **browser-use** — look for `browser_use` tool or `browser-use` CLI
-3. **cmux-browser** — look for cmux browser surfaces
+| Type | Tools | Evidence Captured |
+|------|-------|-------------------|
+| `frontend` | Browser tool (auto-detected) | screenshot, console_log, network_log, html_snapshot |
+| `backend` | Bash (curl/httpie) | api_response, api_headers, db_snapshot |
+| `cli` | Bash | stdout, stderr, exit_code |
-Use whichever is found first. The step agent needs these actions:
+If `type` is omitted, the step agent infers from the step content.
-| Action | What to do |
-|--------|-----------|
-| Navigate | Go to a URL |
-| Inspect | Get page state before interacting |
-| Click | Click element by text/role/ref |
-| Type | Enter text into input |
-| Screenshot | Capture page as image file |
-| Wait | Wait for text/element/timeout |
-| Console | Read browser console output |
-See [run-browser.md](references/run-browser.md) for tool-specific syntax.
+See [run-tools.md](references/run-tools.md) for tool-specific guidance per type.
 ## Step Agent Prompt Template
@@ -140,24 +132,31 @@ Execute step {stepId} of scenario {scenarioSlug}:
 ## Step Content
 {content from steps/{stepId}-{slug}.md}
+## Step Type
+{type from step frontmatter, or infer from content: frontend|backend|cli}
 ## Project Context
 Base URL: {from config.yaml}
-Auth: {from prd.md if needed}
+Services: {from prd.md — list of services under test}
 ## Previous Step
 Status: {pass|fail|first step}
 ## Your Task
-1. Execute the actions using the available browser tool
-2. Inspect the page before clicking or typing
-3. Capture before/after screenshots
-4. Record evidence: harshjudge evidence <runId> --step {stepNumber} --type screenshot --name before --data /path/to/screenshot.png
+1. Read the step type from frontmatter (frontend/backend/cli)
+2. Execute the actions using the appropriate tool:
+   - frontend: use available browser tool
+   - backend: use curl/httpie via Bash
+   - cli: run commands via Bash
+3. Capture evidence appropriate to the step type
+4. Record evidence: harshjudge evidence <runId> --step {stepNumber} --type <evidence_type> --name <name> --data <path_or_data>
 Return ONLY a JSON object:
 {
   "status": "pass" | "fail",
-  "evidencePaths": ["path1.png", "path2.png"],
-  "error": null | "error message"
+  "evidencePaths": ["path1", "path2"],
+  "error": null | "error message",
+  "summary": "Brief description of what happened and result (1-2 sentences)"
 }
 DO NOT return full evidence content. DO NOT explain your work.

package/skills/harshjudge/assets/prd.md CHANGED Viewed

@@ -1,15 +1,14 @@
 # Project PRD
 ## Application Type
-<!-- backend | fullstack | frontend | other -->
+<!-- backend | fullstack | frontend | cli | other -->
 {app_type}
-## Ports
-| Service | Port |
-|---------|------|
-| Frontend | {frontend_port} |
-| Backend | {backend_port} |
-| Database | {database_port} |
+## Services Under Test
+| Service | Type | Endpoint/Command |
+|---------|------|-----------------|
+| {service_name} | frontend/backend/cli | {url or command} |
 ## Main Scenarios
 <!-- High-level list of main testing scenarios -->
@@ -26,9 +25,9 @@
 ## Tech Stack
 <!-- Frameworks, libraries, tools -->
-- Frontend: {frontend_stack}
-- Backend: {backend_stack}
-- Testing: {testing_tools}
+- {stack_item_1}
+- {stack_item_2}
+- {stack_item_3}
 ## Notes
 <!-- Additional context for test scenarios -->

package/skills/harshjudge/references/create.md CHANGED Viewed

@@ -28,7 +28,7 @@ Read .harshJudge/prd.md
 Check for:
 - Existing user flows to test
-- Known UI patterns and selectors
+- Known patterns, endpoints, and commands
 - Timing considerations
 - Environment requirements
 - Test credentials
@@ -52,6 +52,7 @@ Each step needs:
 ```typescript
 {
+  type: "frontend" | "backend" | "cli",  // Step execution mode (optional, inferred if omitted)
   title: string,           // Step title (becomes filename)
   description?: string,    // What this step does
   preconditions?: string,  // Required state before step
@@ -60,9 +61,10 @@ Each step needs:
 }
 ```
-**Example step:**
+**Example step (frontend):**
 ```json
 {
+  "type": "frontend",
   "title": "Navigate to login",
   "description": "Open the application login page",
   "preconditions": "Application is running at baseUrl",
@@ -71,6 +73,30 @@ Each step needs:
 }
 ```
+**Example step (backend):**
+```json
+{
+  "type": "backend",
+  "title": "Create user via API",
+  "description": "POST to /api/users and verify 201 response",
+  "preconditions": "Server is running at baseUrl",
+  "actions": "1. POST /api/users with {name: 'test', email: 'test@example.com'}\n2. Capture response body and status code",
+  "expectedOutcome": "Response status 201, body contains user id"
+}
+```
+**Example step (cli):**
+```json
+{
+  "type": "cli",
+  "title": "Generate config file",
+  "description": "Run the generate command and verify output",
+  "preconditions": "Tool is installed and on PATH",
+  "actions": "1. Run my-tool generate --config prod\n2. Capture stdout and exit code",
+  "expectedOutcome": "Exit code 0, stdout contains 'Generated successfully'"
+}
+```
 ### Step 4: Run create
 Pass scenario data as JSON via stdin or a file:
@@ -197,6 +223,10 @@ avgDuration: 0
 **Step file format (01-navigate-to-login.md):**
 ```markdown
+---
+type: frontend
+---
 # Step 01: Navigate to login
 ## Description

package/skills/harshjudge/references/iterate.md CHANGED Viewed

@@ -13,13 +13,12 @@ Use this workflow when:
 - `harshjudge status <slug>` — review failed run evidence
 - `harshjudge create <slug>` — update scenario with step files
 - `harshjudge start` + `harshjudge complete-step` + `harshjudge complete-run` — re-run test
-- Playwright tools for browser automation
 ## Core Philosophy: Learn from Failures
 **Failed runs are valuable data, not waste.** Each failed run provides:
-1. Screenshots showing what actually happened (in `step-XX/evidence/`)
-2. Logs revealing backend behavior
+1. Step evidence showing what actually happened (in `step-XX/evidence/`)
+2. Logs, responses, and output revealing actual behavior
 3. Evidence of gaps between expectation and reality
 **Goal:** Use evidence to iterate toward a scenario that accurately tests the intended behavior, and **accumulate learnings** in `prd.md`.
@@ -42,7 +41,7 @@ Navigate to the failed run's evidence directories:
 .harshJudge/scenarios/{slug}/runs/{runId}/
 ```
-Read `result.json` for per-step details. View screenshots in `step-XX/evidence/`.
+Read `result.json` for per-step details. Review step evidence (screenshots, responses, output) in `step-XX/evidence/`.
 ### Step 3: Review the Dashboard
@@ -52,14 +51,22 @@ harshjudge dashboard open
 Open `http://localhost:3001` → Scenario → Failed Run.
-Examine: before/after screenshots, console logs, network logs.
+Examine: step evidence (screenshots, responses, output), console logs, network logs.
 ### Step 4: Classify the Failure
 | Failure Type | Description | Action | Document In |
 |-------------|-------------|--------|-------------|
-| **Selector Broken** | UI changed, selectors outdated | Edit step file | prd.md (selector notes) |
-| **Timing Issue** | Action too fast, element not ready | Add wait to step | prd.md (timing patterns) |
+| **Frontend: Element not found** | UI changed, element missing or relocated | Edit step file with updated actions | prd.md (UI patterns) |
+| **Frontend: Page didn't load** | Navigation failed or timed out | Add wait, check URL | prd.md (timing patterns) |
+| **Frontend: Visual mismatch** | Page state differs from expectation | Update expected outcome | — |
+| **Backend: Status code mismatch** | API returned unexpected status | Update step or fix app | prd.md (known behaviors) |
+| **Backend: Response schema drift** | Response shape changed | Update expected outcome | prd.md (schema notes) |
+| **Backend: Timeout** | Request took too long | Add timeout, check service | prd.md (env setup) |
+| **CLI: Non-zero exit code** | Command failed unexpectedly | Check stderr, update step | prd.md (known errors) |
+| **CLI: Missing output** | Expected text absent from stdout | Update expected outcome | — |
+| **CLI: Unexpected stderr** | Warnings or errors in stderr | Investigate root cause | prd.md (known bugs) |
+| **Timing Issue** | Action too fast, resource not ready | Add wait to step | prd.md (timing patterns) |
 | **Step Mismatch** | Step describes wrong flow | Edit step file | — |
 | **Missing Step** | Need additional step | Add step, update scenario | — |
 | **App Bug** | Application has actual bug | Mark as known-fail | prd.md (known bugs) |
@@ -102,10 +109,10 @@ Follow [[run]] workflow:
 **Root Cause:** Email input selector changed from `.email-input` to `[data-testid="email"]`
 **Changes Made:**
-- Updated step-02 Playwright selectors to use data-testid attributes
+- Updated step-02 actions to match the new API response schema
 **Learning:**
-- Always prefer data-testid selectors over class names
+- Always verify response schema against live API, not just status code
 ```
 ### Step 8: Report Iteration Result
@@ -117,17 +124,17 @@ Previous Run: {runId} (FAIL at step 02)
 New Run: {newRunId} (PASS)
 Changes:
-- Updated step-02 selectors to use data-testid
+- Updated step-02 expected outcome to match new API response schema
 Learnings recorded in prd.md:
-- Selector convention: prefer data-testid attributes
+- API response schema: always check body structure, not just status code
 ```
 ---
 ## Best Practices
-1. **Review step evidence first** — before changing anything, examine before/after screenshots
+1. **Review step evidence first** — before changing anything, examine step evidence (screenshots, responses, output)
 2. **Edit individual steps when possible** — for small fixes, edit the `.md` file directly
 3. **Use create for major changes** — when adding/removing steps or reorganizing
 4. **Document learnings in prd.md** — after each successful iteration

package/skills/harshjudge/references/run-step-agent.md CHANGED Viewed

@@ -10,27 +10,39 @@ Execute step {stepId} of scenario {scenarioSlug}:
 ## Step Content
 {paste content from steps/{step.file}}
+## Step Type
+{type from step frontmatter, or infer from content: frontend|backend|cli}
 ## Project Context
 Base URL: {from config.yaml}
-Auth: {from prd.md if this step involves login}
+Services: {from prd.md — list of services under test}
 ## Previous Step
 Status: {pass|fail|first step}
 ## Your Task
-1. Navigate to the base URL if not already there
-2. Execute the actions described in the step content
-3. Use the available browser tool to inspect the page before interacting
-4. Take before/after screenshots using the browser tool
-5. Record evidence:
-   harshjudge evidence {runId} --step {stepNumber} --type screenshot --name before --data /path/to/screenshot.png
-6. Verify the expected outcome
-7. Write a summary describing what happened and whether expected outcome matched
-Return ONLY a JSON object:
+Based on step type:
+**frontend:**
+1. Use the available browser tool to navigate and interact
+2. Inspect the page before clicking or typing
+3. Take before/after screenshots
+4. Record evidence: harshjudge evidence {runId} --step {stepNumber} --type screenshot --name before --data /path/to/screenshot.png
+**backend:**
+1. Execute HTTP requests using curl or httpie via Bash
+2. Capture the full response (status, headers, body)
+3. Record evidence: harshjudge evidence {runId} --step {stepNumber} --type api_response --name response --data /path/to/response.json
+**cli:**
+1. Run the specified commands via Bash
+2. Capture stdout and stderr
+3. Record evidence: harshjudge evidence {runId} --step {stepNumber} --type stdout --name output --data /path/to/output.txt
+Then verify the expected outcome and return ONLY a JSON object:
 {
   "status": "pass" | "fail",
-  "evidencePaths": ["path1.png", "path2.png"],
+  "evidencePaths": ["path1", "path2"],
   "error": null | "error message",
   "summary": "Brief description of what happened and result (1-2 sentences)"
 }
@@ -57,7 +69,7 @@ Task tool with:
   "status": "pass",
   "evidencePaths": [
     ".harshJudge/scenarios/login-flow/runs/abc123xyz/step-01/evidence/before.png",
-    ".harshJudge/scenarios/login-flow/runs/abc123xyz/step-01/evidence/after.png"
+    ".harshJudge/scenarios/login-flow/runs/abc123xyz/step-01/evidence/response.json"
   ],
   "error": null,
   "summary": "Login form loaded successfully. Email and password fields visible."

package/skills/harshjudge/references/run-tools.md ADDED Viewed

@@ -0,0 +1,95 @@
+# Tool Reference by Step Type
+Used during step execution in [[run]].
+HarshJudge supports three step types. Use the tools appropriate to the step type.
+## Frontend Steps
+Use whatever browser automation tool is available in your environment.
+### Required Capabilities
+| Action | What to do |
+|--------|-----------|
+| Navigate | Go to a URL |
+| Inspect | Get page state before interacting |
+| Click | Click element by text/role/ref |
+| Type | Enter text into input |
+| Screenshot | Capture page as image file |
+| Wait | Wait for text/element/timeout |
+| Console | Read browser console output |
+### Supported Browser Tools
+**Playwright MCP** (default):
+Tools: `browser_navigate`, `browser_click`, `browser_type`, `browser_snapshot`, `browser_take_screenshot`, `browser_wait_for`, `browser_console_messages`, `browser_network_requests`
+**browser-use MCP** (token efficient):
+See [browser-use MCP docs](https://docs.browser-use.com/customize/integrations/mcp-server)
+**Chrome DevTools MCP:**
+Tools: page navigation, DOM inspection, network monitoring via Chrome remote debugging
+### Best Practices
+- Inspect the page before clicking or typing
+- Take a screenshot **before** and **after** each significant action
+- Wait after navigation to confirm page loaded
+- Capture console errors on unexpected behavior
+## Backend Steps
+Use Bash to make HTTP requests and query databases.
+### HTTP Requests
+```bash
+# Using curl
+curl -s -w "\n%{http_code}" -H "Content-Type: application/json" \
+  -X POST http://localhost:3000/api/users \
+  -d '{"name": "test"}' > /tmp/response.json
+# Save response for evidence
+harshjudge evidence <runId> --step 1 --type api_response --name create-user --data /tmp/response.json
+```
+### Database Queries
+```bash
+# PostgreSQL example
+psql -h localhost -U user -d mydb -c "SELECT * FROM users WHERE email='test@example.com'" \
+  --csv > /tmp/db-result.csv
+harshjudge evidence <runId> --step 1 --type db_snapshot --name users-check --data /tmp/db-result.csv
+```
+### Best Practices
+- Always capture the full response (status code + headers + body)
+- Save responses to temp files, then record via `harshjudge evidence`
+- For auth flows, chain requests (login → use token → verify)
+- Check response schema, not just status code
+## CLI Steps
+Use Bash to run commands and capture output.
+### Command Execution
+```bash
+# Run command and capture output
+my-tool generate --config prod > /tmp/stdout.txt 2> /tmp/stderr.txt
+echo $? > /tmp/exit-code.txt
+# Record evidence
+harshjudge evidence <runId> --step 1 --type stdout --name generate-output --data /tmp/stdout.txt
+harshjudge evidence <runId> --step 1 --type exit_code --name generate-exit --data /tmp/exit-code.txt
+```
+### Best Practices
+- Capture both stdout and stderr separately
+- Always check exit code
+- For long-running commands, use timeout
+- Save output to temp files before recording evidence

package/skills/harshjudge/references/run.md CHANGED Viewed

@@ -15,7 +15,7 @@ Use this workflow when user wants to:
 3. `harshjudge complete-step <runId>` — Complete each step, get next step
 4. `harshjudge complete-run <runId>` — Finalize with pass/fail status
-See [[run-browser]] for browser tool reference (Playwright MCP, browser-use, Chrome DevTools).
+See [[run-tools]] for tool reference by step type (frontend, backend, CLI).
 > **TOKEN OPTIMIZATION**: Each step executes in its own spawned agent. This isolates context and prevents token accumulation.
@@ -108,7 +108,7 @@ harshjudge evidence <runId> \
 Saved to: `.harshJudge/scenarios/{slug}/runs/{runId}/step-01/evidence/`
-Evidence types: `screenshot`, `console_log`, `network_log`, `html_snapshot`.
+Evidence types: `screenshot`, `console_log`, `network_log`, `html_snapshot`, `api_response`, `api_headers`, `db_snapshot`, `stdout`, `stderr`, `exit_code`, `custom`.
 ## Step Tracking (MANDATORY)

package/skills/harshjudge/references/run-browser.md DELETED Viewed

@@ -1,63 +0,0 @@
-# Browser Tool Reference
-Used during step execution in [[run]].
-HarshJudge is **browser-tool-agnostic**. Use whatever browser automation tool is available in your environment. The step agent needs these capabilities:
-## Required Capabilities
-| Action | What to do |
-|--------|-----------|
-| Navigate | Go to a URL |
-| Inspect page | Get current page state (DOM, accessibility tree) before interacting |
-| Click | Click an element by text, role, or reference |
-| Type | Enter text into an input field |
-| Select | Choose an option from a dropdown |
-| Wait | Wait for text to appear/disappear, or for a timeout |
-| Screenshot | Capture the current page as an image file |
-| Console logs | Read browser console output |
-| Network logs | Read network requests/responses |
-## Supported Browser Tools
-### Playwright MCP (Default)
-Most common. Available as a Claude Code plugin.
-```json
-{
-  "playwright": {
-    "command": "npx",
-    "args": ["@playwright/mcp@latest"]
-  }
-}
-```
-Tools: `browser_navigate`, `browser_click`, `browser_type`, `browser_snapshot`, `browser_take_screenshot`, `browser_wait_for`, `browser_console_messages`, `browser_network_requests`
-### browser-use MCP (Token Efficient Alternative)
-Compresses DOM before sending to LLM — significantly fewer tokens per interaction. Python-based.
-Setup: See [browser-use MCP docs](https://docs.browser-use.com/customize/integrations/mcp-server)
-### Chrome DevTools MCP
-Connects to an already-running Chrome instance via remote debugging.
-```json
-{
-  "chrome-devtools": {
-    "command": "npx",
-    "args": ["chrome-devtools-mcp"]
-  }
-}
-```
-## Best Practices
-- Always inspect the page before clicking or typing to get current element state
-- Take a screenshot **before** and **after** each significant action
-- Wait after navigation to confirm the page loaded
-- Capture console errors on unexpected behavior
-- Save screenshots to a temp path, then record via `harshjudge evidence`