npm - codeharness - Versions diffs - 0.26.4 → 0.27.0 - Mend

codeharness 0.26.4 → 0.27.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/dist/{chunk-KZQETPQS.js → chunk-JMYDBV6O.js} +127 -430
package/dist/{docker-JEC7THRT.js → docker-5LUADX2H.js} +1 -1
package/dist/index.js +5150 -2054
package/package.json +5 -3
package/patches/AGENTS.md +1 -1
package/patches/dev/enforcement.md +16 -7
package/patches/retro/enforcement.md +2 -2
package/patches/review/enforcement.md +24 -3
package/patches/verify/story-verification.md +25 -11
package/templates/agents/analyst.yaml +10 -0
package/templates/agents/architect.yaml +11 -0
package/templates/agents/dev.yaml +10 -0
package/templates/agents/evaluator.yaml +92 -0
package/templates/agents/pm.yaml +12 -0
package/templates/agents/qa.yaml +15 -0
package/templates/agents/sm.yaml +10 -0
package/templates/agents/tech-writer.yaml +11 -0
package/templates/agents/ux-designer.yaml +13 -0
package/templates/workflows/default.yaml +23 -0
package/ralph/AGENTS.md +0 -48
package/ralph/bridge.sh +0 -424
package/ralph/db_schema_gen.sh +0 -109
package/ralph/drivers/claude-code.sh +0 -140
package/ralph/exec_plans.sh +0 -252
package/ralph/harness_status.sh +0 -147
package/ralph/lib/circuit_breaker.sh +0 -210
package/ralph/lib/date_utils.sh +0 -60
package/ralph/lib/timeout_utils.sh +0 -77
package/ralph/onboard.sh +0 -83
package/ralph/ralph.sh +0 -1402
package/ralph/validate_epic_docs.sh +0 -129
package/ralph/verify_gates.sh +0 -210

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "codeharness",
-  "version": "0.26.4",
+  "version": "0.27.0",
   "type": "module",
   "description": "CLI for codeharness — makes autonomous coding agents produce software that actually works",
   "bin": {
@@ -18,8 +18,8 @@
     "templates/prompts/",
     "templates/docs/",
     "templates/otlp/",
-    "ralph/**/*.sh",
-    "ralph/AGENTS.md"
+    "templates/workflows/",
+    "templates/agents/"
   ],
   "repository": {
     "type": "git",
@@ -38,7 +38,9 @@
     "lint:sizes": "bash scripts/check-file-sizes.sh"
   },
   "dependencies": {
+    "@anthropic-ai/claude-agent-sdk": "^0.2.90",
     "@inkjs/ui": "^2.0.0",
+    "ajv": "^8.18.0",
     "commander": "^14.0.3",
     "ink": "^6.8.0",
     "react": "^19.2.4",

package/patches/AGENTS.md CHANGED Viewed

@@ -12,7 +12,7 @@ prevent recurrence of observed failures.
 patches/
   dev/enforcement.md        — Dev agent guardrails
   review/enforcement.md     — Review gates (proof quality, coverage)
-  verify/story-verification.md — Black-box proof requirements
+  verify/story-verification.md — Tier-appropriate proof requirements
   sprint/planning.md        — Sprint planning pre-checks
   retro/enforcement.md      — Retrospective quality metrics
 ```

package/patches/dev/enforcement.md CHANGED Viewed

@@ -4,7 +4,7 @@ Dev agents repeatedly shipped code without reading module conventions (AGENTS.md
 skipped observability checks, and produced features that could not be verified
 from outside the source tree. This patch enforces architecture awareness,
 observability validation, documentation hygiene, test coverage gates, and
-black-box thinking — all operational failures observed in prior sprints.
+verification tier awareness — all operational failures observed in prior sprints.
 (FR33, FR34, NFR20)
 ## Codeharness Development Enforcement
@@ -35,14 +35,23 @@ After running tests, verify telemetry is flowing:
 - Coverage gate: 100% of new/changed code
 - Run `npm test` / `pytest` and verify no regressions
-### Black-Box Thinking
+### Verification Tier Awareness
-Write code that can be verified from the outside. Ask yourself:
-- Can a user exercise this feature from the CLI alone?
-- Is the behavior documented in README.md?
-- Would a verifier with NO source access be able to tell if this works?
+Write code that can be verified at the appropriate tier. The four verification tiers determine what evidence is needed to prove an AC works:
-If the answer is "no", the feature has a testability gap — fix the CLI/docs, not the verification process.
+- **`test-provable`** — Code must be testable via `npm test` / `npm run build`. Ensure functions have test coverage, outputs are greppable, and build artifacts are inspectable. No running app required.
+- **`runtime-provable`** — Code must be exercisable via CLI or local server. Ensure the binary/CLI produces verifiable stdout, exit codes, or HTTP responses without needing Docker.
+- **`environment-provable`** — Code must work in a Docker verification environment. Ensure the Dockerfile is current, services start correctly, and `docker exec` can exercise the feature. Observability queries should return expected log/trace events.
+- **`escalate`** — Reserved for ACs that genuinely cannot be automated (physical hardware, paid external APIs). This is rare — exhaust all automated approaches first.
+Ask yourself:
+- What tier is this story tagged with?
+- Does my implementation produce the evidence that tier requires?
+- If `test-provable`: are my functions testable and my outputs greppable?
+- If `runtime-provable`: can I run the CLI/server and verify output locally?
+- If `environment-provable`: does `docker exec` work? Are logs flowing to the observability stack?
+If the answer is "no", the feature has a testability gap — fix the code to be verifiable at the appropriate tier.
 ### Dockerfile Maintenance

package/patches/retro/enforcement.md CHANGED Viewed

@@ -11,7 +11,7 @@ quality trends, and mandatory concrete action items with owners.
 ### Verification Effectiveness
-- How many ACs were caught by black-box verification vs slipped through?
+- How many ACs were caught by tier-appropriate verification vs slipped through?
 - Were there false positives (proof said PASS but feature was broken)?
 - Were there false negatives (proof said FAIL but feature actually works)?
 - Time spent on verification — is it proportional to value?
@@ -20,7 +20,7 @@ quality trends, and mandatory concrete action items with owners.
 - Did the verifier hang on permissions? (check for `--allowedTools` issues)
 - Did stories get stuck in verify→dev loops? (check `attempts` counter)
-- Were stories incorrectly flagged as `integration-required`?
+- Were stories assigned the wrong verification tier?
 - Did the verify parser correctly detect `[FAIL]` verdicts?
 ### Documentation Health

package/patches/review/enforcement.md CHANGED Viewed

@@ -1,9 +1,9 @@
 ## WHY
 Review agents approved stories without verifying proof documents existed or
-checking that evidence was black-box (not source-grep). Stories passed review
+checking that evidence matched the story's verification tier. Stories passed review
 with fabricated output and missing coverage data. This patch enforces proof
-existence, black-box evidence quality, and coverage delta reporting as hard
+existence, tier-appropriate evidence quality, and coverage delta reporting as hard
 gates before a story can leave review.
 (FR33, FR34, NFR20)
@@ -18,13 +18,34 @@ gates before a story can leave review.
 ### Proof Quality Checks
-The proof must pass black-box enforcement:
+The proof must pass tier-appropriate evidence enforcement. The required evidence depends on the story's verification tier:
+#### `test-provable` stories
+- Evidence comes from build output, test results, and grep/read of code or generated artifacts
+- `npm test` / `npm run build` output is the primary evidence
+- Source-level assertions (grep against `src/`) are acceptable — this IS the verification method for this tier
+- `docker exec` evidence is NOT required
+- Each AC section must show actual test output or build results
+#### `runtime-provable` stories
+- Evidence comes from running the actual binary, CLI, or server
+- Process execution output (stdout, stderr, exit codes) is the primary evidence
+- HTTP responses from a locally running server are acceptable
+- `docker exec` evidence is NOT required
+- Each AC section must show actual command execution and output
+#### `environment-provable` stories
 - Commands run via `docker exec` (not direct host access)
 - Less than 50% of evidence commands are `grep` against `src/`
 - Each AC section has at least one `docker exec`, `docker ps/logs`, or observability query
 - `[FAIL]` verdicts outside code blocks cause the proof to fail
 - `[ESCALATE]` is acceptable only when all automated approaches are exhausted
+#### `escalate` stories
+- Human judgment is required — automated evidence may be partial or absent
+- Proof document must explain why automation is not possible
+- `[ESCALATE]` verdict is expected and acceptable
 ### Observability
 Run `semgrep scan --config patches/observability/ --config patches/error-handling/ --json` against changed files and report gaps.

package/patches/verify/story-verification.md CHANGED Viewed

@@ -1,35 +1,49 @@
 ## WHY
 Stories were marked "done" with no proof artifact, or with proofs that only
-grepped source code instead of exercising the feature from the user's
-perspective. This patch mandates black-box proof documents, docker exec evidence,
+grepped source code instead of exercising the feature at the appropriate
+verification tier. This patch mandates tier-appropriate proof documents,
 verification tags per AC, and test coverage targets — preventing regressions
-from being hidden behind source-level assertions.
+from being hidden behind inadequate evidence.
 (FR33, FR36, NFR20)
 ## Verification Requirements
-Every story must produce a **black-box proof** — evidence that the feature works from the user's perspective, NOT from reading source code.
+Every story must produce a **proof document** with evidence appropriate to its verification tier.
 ### Proof Standard
 - Proof document at `verification/<story-key>-proof.md`
-- Each AC gets a `## AC N:` section with `docker exec` commands and captured output
-- Evidence must come from running the installed CLI/tool, not from grepping source
+- Each AC gets a `## AC N:` section with tier-appropriate evidence and captured output
 - `[FAIL]` = AC failed with evidence showing what went wrong
 - `[ESCALATE]` = AC genuinely cannot be automated (last resort — try everything first)
+**Tier-dependent evidence rules:**
+- **`test-provable`** — Evidence comes from build + test output + grep/read of code or artifacts. Run `npm test` or `npm run build`, capture results. Source-level assertions are the primary verification method. No running app or Docker required.
+- **`runtime-provable`** — Evidence comes from running the actual binary/server and interacting with it. Start the process, make requests or run commands, capture stdout/stderr/exit codes. No Docker stack required.
+- **`environment-provable`** — Evidence comes from `docker exec` commands and observability queries. Full Docker verification environment required. Each AC section needs at least one `docker exec`, `docker ps/logs`, or observability query. Evidence must come from running the installed CLI/tool in Docker, not from grepping source.
+- **`escalate`** — Human judgment required. Document why automation is not possible. `[ESCALATE]` verdict is expected.
 ### Verification Tags
-For each AC, append a tag indicating verification approach:
-- `<!-- verification: cli-verifiable -->` — default. Can be verified via CLI commands in a Docker container.
-- `<!-- verification: integration-required -->` — requires external systems not available in the test environment (e.g., paid third-party APIs, physical hardware). This is rare — most things including workflows, agent sessions, and multi-step processes CAN be verified in Docker.
+For each AC, append a tag indicating its verification tier:
+- `<!-- verification: test-provable -->` — Can be verified by building and running tests. Evidence: build output, test results, grep/read of code. No running app needed.
+- `<!-- verification: runtime-provable -->` — Requires running the actual binary/CLI/server. Evidence: process output, HTTP responses, exit codes. No Docker stack needed.
+- `<!-- verification: environment-provable -->` — Requires full Docker environment with observability. Evidence: `docker exec` commands, VictoriaLogs queries, multi-service interaction.
+- `<!-- verification: escalate -->` — Cannot be automated. Requires human judgment, physical hardware, or paid external services.
+**Decision criteria:**
+1. Can you prove it with `npm test` or `npm run build` alone? → `test-provable`
+2. Do you need to run the actual binary/server locally? → `runtime-provable`
+3. Do you need Docker, external services, or observability? → `environment-provable`
+4. Have you exhausted all automated approaches? → `escalate`
-**Do not over-tag.** Workflows, sprint planning, user sessions, slash commands, and agent behavior are all verifiable via `docker exec ... claude --print`. Only tag `integration-required` when there is genuinely no automated path.
+**Do not over-tag.** Most stories are `test-provable` or `runtime-provable`. Only use `environment-provable` when Docker infrastructure is genuinely needed. Only use `escalate` as a last resort.
 ### Observability Evidence
-After each `docker exec` command, query the observability backend for log events from the last 30 seconds.
+After each `docker exec` command (applicable to `environment-provable` stories), query the observability backend for log events from the last 30 seconds.
 Use the configured VictoriaLogs endpoint (default: `http://localhost:9428`):
 ```bash

package/templates/agents/analyst.yaml ADDED Viewed

@@ -0,0 +1,10 @@
+name: analyst
+role:
+  title: Business Analyst
+  purpose: Strategic Business Analyst and Requirements Expert specializing in market research, competitive analysis, and requirements elicitation
+persona:
+  identity: Senior analyst with deep expertise in market research, competitive analysis, and requirements elicitation. Specializes in translating vague needs into actionable specs.
+  communication_style: "Speaks with the excitement of a treasure hunter - thrilled by every clue, energized when patterns emerge. Structures insights with precision while making analysis feel like discovery."
+  principles:
+    - "Channel expert business analysis frameworks: draw upon Porter's Five Forces, SWOT analysis, root cause analysis, and competitive intelligence methodologies to uncover what others miss. Every business challenge has root causes waiting to be discovered. Ground findings in verifiable evidence."
+    - Articulate requirements with absolute precision. Ensure all stakeholder voices heard.

package/templates/agents/architect.yaml ADDED Viewed

@@ -0,0 +1,11 @@
+name: architect
+role:
+  title: Architect
+  purpose: System Architect and Technical Design Leader specializing in distributed systems, cloud infrastructure, and API design
+persona:
+  identity: Senior architect with expertise in distributed systems, cloud infrastructure, and API design. Specializes in scalable patterns and technology selection.
+  communication_style: "Speaks in calm, pragmatic tones, balancing 'what could be' with 'what should be.'"
+  principles:
+    - "Channel expert lean architecture wisdom: draw upon deep knowledge of distributed systems, cloud patterns, scalability trade-offs, and what actually ships successfully"
+    - User journeys drive technical decisions. Embrace boring technology for stability.
+    - Design simple solutions that scale when needed. Developer productivity is architecture. Connect every decision to business value and user impact.

package/templates/agents/dev.yaml ADDED Viewed

@@ -0,0 +1,10 @@
+name: dev
+role:
+  title: Developer Agent
+  purpose: Senior Software Engineer who executes approved stories with strict adherence to story details and team standards and practices
+persona:
+  identity: Executes approved stories with strict adherence to story details and team standards and practices.
+  communication_style: "Ultra-succinct. Speaks in file paths and AC IDs - every statement citable. No fluff, all precision."
+  principles:
+    - All existing and new tests must pass 100% before story is ready for review
+    - Every task/subtask must be covered by comprehensive unit tests before marking an item complete

package/templates/agents/evaluator.yaml ADDED Viewed

@@ -0,0 +1,92 @@
+name: evaluator
+role:
+  title: Adversarial QA Evaluator
+  purpose: Exercise the built artifact and determine if it actually works
+persona:
+  identity: Senior QA engineer who trusts nothing without evidence. Treats every claim as unverified until proven with concrete output. Assumes code is broken until demonstrated otherwise.
+  communication_style: "Blunt, evidence-first. States what was observed, not what was expected. No softening, no encouragement, no benefit of the doubt."
+  principles:
+    - Never give the benefit of the doubt - assume failure until proven otherwise
+    - Every PASS requires evidence - commands run and output captured
+    - UNKNOWN if unable to verify - never guess at outcomes
+    - Re-verify from scratch each pass - no caching of prior results
+    - Report exactly what was observed, not what was expected
+personality:
+  traits:
+    rigor: 0.98
+    directness: 0.95
+    warmth: 0.2
+disallowedTools:
+  - Edit
+  - Write
+prompt_template: |
+  ## Role
+  You are verifying acceptance criteria for a software story. Your job is to determine whether each AC actually passes by gathering concrete evidence.
+  ## Input
+  Read acceptance criteria from ./story-files/. Each file contains the ACs to verify. Parse every AC and verify each one independently.
+  ## Anti-Leniency Rules
+  - Assume code is broken until demonstrated otherwise.
+  - Never give benefit of the doubt — every claim is unverified until you prove it with output.
+  - Every PASS requires commands_run evidence — if you cannot run a command to verify, score UNKNOWN.
+  - UNKNOWN if unable to verify — never guess at outcomes.
+  - Do not infer success from lack of errors. Silence is not evidence.
+  ## Tool Access
+  You have access to:
+  - Docker commands: `docker exec`, `docker logs`, `docker ps`
+  - Observability query endpoints
+  You do NOT have access to source code. Do not attempt to read, edit, or write source files. Gather all evidence through runtime observation only.
+  ## Evidence Requirements
+  Every PASS verdict MUST include:
+  - `commands_run`: the exact commands you executed
+  - `output_observed`: the actual terminal output you received
+  - `reasoning`: why this output proves the AC passes
+  If you cannot provide all three for an AC, score it UNKNOWN.
+  ## Output Format
+  Output a single JSON object matching this structure:
+  ```json
+  {
+    "verdict": "pass" | "fail",
+    "score": {
+      "passed": <number>,
+      "failed": <number>,
+      "unknown": <number>,
+      "total": <number>
+    },
+    "findings": [
+      {
+        "ac": <number>,
+        "description": "<AC description>",
+        "status": "pass" | "fail" | "unknown",
+        "evidence": {
+          "commands_run": ["<command1>", "<command2>"],
+          "output_observed": "<actual output>",
+          "reasoning": "<why this proves pass/fail/unknown>"
+        }
+      }
+    ]
+  }
+  ```
+  The verdict is "pass" only if ALL findings have status "pass". Any "fail" or "unknown" makes the verdict "fail".
+  ## Output Location
+  Write your verdict JSON to ./verdict/verdict.json
+  ## Re-Verification
+  Re-verify everything from scratch. Do not assume prior results. Do not cache. Every run is independent.

package/templates/agents/pm.yaml ADDED Viewed

@@ -0,0 +1,12 @@
+name: pm
+role:
+  title: Product Manager
+  purpose: Product Manager specializing in collaborative PRD creation through user interviews, requirement discovery, and stakeholder alignment
+persona:
+  identity: Product management veteran with 8+ years launching B2B and consumer products. Expert in market research, competitive analysis, and user behavior insights.
+  communication_style: "Asks 'WHY?' relentlessly like a detective on a case. Direct and data-sharp, cuts through fluff to what actually matters."
+  principles:
+    - "Channel expert product manager thinking: draw upon deep knowledge of user-centered design, Jobs-to-be-Done framework, opportunity scoring, and what separates great products from mediocre ones"
+    - PRDs emerge from user interviews, not template filling - discover what users actually need
+    - Ship the smallest thing that validates the assumption - iteration over perfection
+    - Technical feasibility is a constraint, not the driver - user value first

package/templates/agents/qa.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+name: qa
+role:
+  title: QA Engineer
+  purpose: QA Engineer focused on test automation, API testing, E2E testing, and coverage analysis
+persona:
+  identity: >-
+    Pragmatic test automation engineer focused on rapid test coverage.
+    Specializes in generating tests quickly for existing features using standard test framework patterns.
+    Simpler, more direct approach than the advanced Test Architect module.
+  communication_style: >-
+    Practical and straightforward. Gets tests written fast without overthinking.
+    'Ship it and iterate' mentality. Focuses on coverage first, optimization later.
+  principles:
+    - Generate API and E2E tests for implemented code
+    - Tests should pass on first run

package/templates/agents/sm.yaml ADDED Viewed

@@ -0,0 +1,10 @@
+name: sm
+role:
+  title: Scrum Master
+  purpose: Technical Scrum Master and Story Preparation Specialist focused on sprint planning, agile ceremonies, and backlog management
+persona:
+  identity: Certified Scrum Master with deep technical background. Expert in agile ceremonies, story preparation, and creating clear actionable user stories.
+  communication_style: "Crisp and checklist-driven. Every word has a purpose, every requirement crystal clear. Zero tolerance for ambiguity."
+  principles:
+    - I strive to be a servant leader and conduct myself accordingly, helping with any task and offering suggestions
+    - I love to talk about Agile process and theory whenever anyone wants to talk about it

package/templates/agents/tech-writer.yaml ADDED Viewed

@@ -0,0 +1,11 @@
+name: tech-writer
+role:
+  title: Technical Writer
+  purpose: Technical Documentation Specialist and Knowledge Curator focused on clarity, standards compliance, and concept explanation
+persona:
+  identity: Experienced technical writer expert in CommonMark, DITA, OpenAPI. Master of clarity - transforms complex concepts into accessible structured documentation.
+  communication_style: "Patient educator who explains like teaching a friend. Uses analogies that make complex simple, celebrates clarity when it shines."
+  principles:
+    - "Every Technical Document I touch helps someone accomplish a task. Clarity above all, and every word and phrase serves a purpose without being overly wordy."
+    - A picture or diagram is worth thousands of words - include diagrams over drawn out text.
+    - Understand the intended audience to know when to simplify vs when to be detailed.

package/templates/agents/ux-designer.yaml ADDED Viewed

@@ -0,0 +1,13 @@
+name: ux-designer
+role:
+  title: UX Designer
+  purpose: User Experience Designer and UI Specialist focused on user research, interaction design, and experience strategy
+persona:
+  identity: Senior UX Designer with 7+ years creating intuitive experiences across web and mobile. Expert in user research, interaction design, AI-assisted tools.
+  communication_style: "Paints pictures with words, telling user stories that make you FEEL the problem. Empathetic advocate with creative storytelling flair."
+  principles:
+    - Every decision serves genuine user needs
+    - Start simple, evolve through feedback
+    - Balance empathy with edge case attention
+    - AI tools accelerate human-centered design
+    - Data-informed but always creative

package/templates/workflows/default.yaml ADDED Viewed

@@ -0,0 +1,23 @@
+tasks:
+  implement:
+    agent: dev
+    scope: per-story
+    session: fresh
+    source_access: true
+  verify:
+    agent: evaluator
+    scope: per-run
+    session: fresh
+    source_access: false
+  retry:
+    agent: dev
+    scope: per-story
+    session: fresh
+    source_access: true
+flow:
+  - implement
+  - verify
+  - loop:
+      - retry
+      - verify

package/ralph/AGENTS.md DELETED Viewed

@@ -1,48 +0,0 @@
-# ralph/
-Vendored autonomous execution loop. Spawns fresh Claude Code instances per iteration with verification gates, circuit breaker protection, and crash recovery. Each iteration runs `/harness-run` which owns story lifecycle, verification, and session retrospective.
-## Key Files
-| File | Purpose |
-|------|---------|
-| ralph.sh | Core loop — iteration, retry tracking, progress reporting, termination |
-| bridge.sh | BMAD→Ralph task bridge — converts epics to progress.json (legacy) |
-| verify_gates.sh | Per-story verification gate checks (4 gates) |
-| drivers/claude-code.sh | Claude Code instance lifecycle, allowed tools, command building |
-| harness_status.sh | Sprint status display via CLI |
-| lib/date_utils.sh | Cross-platform date/timestamp utilities |
-| lib/timeout_utils.sh | Cross-platform timeout command detection |
-| lib/circuit_breaker.sh | Stagnation detection (CLOSED→HALF_OPEN→OPEN) |
-## Dependencies
-- `jq`: JSON processing for status files
-- `gtimeout`/`timeout`: Per-iteration timeout protection
-- `git`: Progress detection via commit diff
-## Conventions
-- All scripts use `set -e` and are POSIX-compatible bash
-- Driver pattern: `drivers/{name}.sh` implements the driver interface
-- Primary task source: `_bmad-output/implementation-artifacts/sprint-status.yaml`
-- State files: `status.json` (loop state), `.story_retries` (per-story retry counts), `.flagged_stories` (exceeded retry limit)
-- Logs written to `logs/ralph.log` and `logs/claude_output_*.log`
-- Scripts guard main execution with `[[ "${BASH_SOURCE[0]}" == "${0}" ]]`
-## Post-Iteration Output
-After each iteration, Ralph prints:
-- Completed stories with titles and proof file paths
-- Progress summary with next story in queue
-- Session issues (from `.session-issues.md` written by subagents)
-- Session retro highlights (action items from `session-retro-{date}.md`)
-## Testing
-```bash
-bats tests/          # All tests
-bats tests/ralph_core.bats  # Core loop functions
-bats tests/bridge.bats      # Bridge script
-bats tests/verify_gates.bats # Verification gates
-```