npm - yam-harness - Versions diffs - 0.1.0 - Mend

yam-harness 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/AGENTS.md +18 -0
package/COMMANDS.md +144 -0
package/DECISIONS.md +70 -0
package/LICENSE +21 -0
package/README.md +159 -0
package/ROADMAP.md +308 -0
package/bin/yam.js +1966 -0
package/package.json +74 -0
package/references/context-reuse.md +59 -0
package/references/current-docs.md +45 -0
package/references/db-supabase-safety-lite.md +40 -0
package/references/doctor-scan.md +56 -0
package/references/eye.md +30 -0
package/references/final-report.md +61 -0
package/references/honest-completion.md +61 -0
package/references/hook-lite.md +55 -0
package/references/markdown-management.md +56 -0
package/references/memory.md +59 -0
package/references/mission.md +86 -0
package/references/question.md +25 -0
package/references/quick.md +70 -0
package/references/risk-escalation.md +27 -0
package/references/runtime-orchestration.md +57 -0
package/references/scout.md +38 -0
package/references/token-budget-reporter.md +44 -0
package/references/token-economy.md +61 -0
package/references/tool-trust-layer.md +113 -0
package/references/truth-matrix.md +44 -0
package/references/ueye.md +83 -0
package/references/ui-quality.md +23 -0
package/references/verification-levels.md +53 -0
package/skills/deep/SKILL.md +76 -0
package/skills/mission/SKILL.md +105 -0
package/skills/question/SKILL.md +45 -0
package/skills/quick/SKILL.md +81 -0
package/skills/scout/SKILL.md +71 -0
package/skills/ueye/SKILL.md +90 -0
package/templates/mission-plan.md +46 -0
package/templates/runtime-proof.md +54 -0
package/templates/tuning-log.md +39 -0
package/templates/ueye-review.md +62 -0
package/templates/yam.project.md +71 -0
package/yam.manifest.json +48 -0

package/references/question.md ADDED Viewed

@@ -0,0 +1,25 @@
+# Question
+`question` is the smallest yam route: direct Q&A without implementation, broad research, or proof loops.
+Use it when:
+- The user asks a simple explanation question.
+- The answer can be given from stable knowledge or current conversation context.
+- A local fact can be checked with one narrow file/command.
+- The user wants a quick answer, not a decision memo.
+Default behavior:
+- Answer first.
+- Keep the answer compact.
+- Do not change files.
+- Do not turn the answer into research unless freshness or uncertainty requires it.
+- If comparison, market context, external references, or third-party judgment are needed, switch to `scout`.
+Useful output shape:
+- Direct answer.
+- Why.
+- Practical next step.
+- Uncertainty, only when relevant.

package/references/quick.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Quick
+`quick` is the merged small-work route: fast patching, ordinary scoped implementation, and fast error scanning.
+## Borrowed, With Weight Removed
+From Sneakoscope:
+- Honest completion language.
+- Real versus assumed verification.
+- Stop instead of claiming success when evidence is missing.
+From ECC:
+- Detect the smallest useful command.
+- Group build/type/lint/test errors by file and root cause.
+- Fix one error class at a time.
+- Re-run the same focused command after a fix.
+- Use a compact PASS/FAIL matrix.
+From Karpathy-style minimal harness:
+- Keep the instruction short enough to obey.
+- Read the smallest useful context.
+- Avoid ceremony unless it changes the result.
+## Lanes
+Patch:
+- Best for copy, CSS, labels, simple bugs, small docs, and small config changes.
+- Usually read 1-3 files.
+- Usually run 0-1 command.
+Build-fix:
+- Best when there is a concrete command failure.
+- Read the error output first.
+- Sort dependency/import/type errors before logic errors.
+- Fix one root cause before moving to the next.
+Scan:
+- Best when the user wants to know what is broken before editing.
+- Report grouped errors, likely root cause, smallest first fix, and verification command.
+- If the scan sees destructive DB/Supabase, migration, production write, or RLS/policy signals, stop quick mode and recommend `$deep`.
+## Stop Conditions
+Stop or recommend a heavier route when:
+- The same error survives three focused attempts.
+- A fix creates more errors than it removes.
+- The change requires a new dependency, architecture change, or broad refactor.
+- The surface touches auth, payments, data mutation, security, deployment, or process lifecycle.
+- The requested work needs visual reference interpretation or screenshot-led QA.
+## Verification Matrix
+Use this compact shape when commands were run:
+```text
+Verification:
+- typecheck: pass/fail/skipped
+- lint: pass/fail/skipped
+- test: pass/fail/skipped
+- build: pass/fail/skipped
+```
+Only include rows that matter to the actual run.

package/references/risk-escalation.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Risk Escalation
+Escalate by recommendation, not silent mode switching.
+Recommend `$deep` when work touches:
+- DB schema or data mutation.
+- Supabase destructive commands, migrations, production writes, RLS, or policy changes.
+- Auth, payment, billing, permissions.
+- Security-sensitive code.
+- Deployment or release configuration.
+- Broad refactor or many files.
+- Failing verification.
+- Unknown ownership or unclear acceptance criteria.
+- Long-running process lifecycle.
+Recommend `$mission` only when the user has an approved plan and real subagent/team execution would materially reduce risk.
+Suggested phrase:
+```text
+This looks like a deeper route because it touches <risk>.
+I can continue lightly, or switch to $deep for stronger single-agent verification.
+Use $mission only if you want real subagent/team execution.
+```
+Use `db-supabase-safety-lite.md` for the short DB/Supabase guardrail.

package/references/runtime-orchestration.md ADDED Viewed

@@ -0,0 +1,57 @@
+# Runtime Verification
+Runtime verification now lives inside `$deep` and may be used by `$mission`.
+It combines tmux/process management, cleanup confirmation, browser QA, and proof summary when a verification claim needs runtime evidence.
+Use for:
+- Long-running dev servers.
+- Test watchers.
+- Multi-step broad tasks.
+- Browser QA sessions.
+- Work requiring before/after evidence.
+Do not use for ordinary small edits.
+## tmux Policy
+tmux is a first-class runtime verification tool in `yam`, but it is not automatic.
+Use tmux when it improves the work:
+- Long-running dev servers.
+- Test watchers.
+- Multiple concurrent panes.
+- Work where preserving logs or panes helps verification.
+- Tasks that need a stable server while browser QA runs.
+Do not use tmux just to make a small task look more verified.
+When tmux is used, record:
+- Session name.
+- Pane id when available.
+- Command running in the pane.
+- Before/during/after observation.
+- Whether the pane was left running intentionally or closed.
+Runtime proof should include:
+- Goal.
+- Steps run.
+- Commands/processes started.
+- PIDs, ports, sessions, or panes when available.
+- Checks before and after.
+- Cleanup result.
+- Truth status: proven, verified, partial, fixture_only, fixture_instrumented_real, integration_optional, real_required_missing, skipped, blocked, assumed.
+## Cleanup Rule
+Do not claim cleanup succeeded unless one of these is true:
+- Process exit was checked.
+- tmux pane/session was checked closed.
+- The process was intentionally left running and reported that way.
+For `$deep` runtime verification and `$mission`, cleanup evidence matters more than a polished final sentence.

package/references/scout.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Scout
+`scout` is practical investigation and judgment, not academic research.
+Use for:
+- Current docs.
+- Library or tool comparison.
+- Design references.
+- Product direction options.
+- Risk discovery before implementation.
+- Third-party judgment before implementation.
+- Objective and subjective evaluation together.
+- Macro, realistic, and future-facing direction checks.
+Default behavior:
+- Define the question narrowly.
+- Prefer official or primary sources.
+- Gather 3-7 high-signal references by default.
+- Use current docs proof only for modern SDK/API/cloud-service behavior where stale knowledge is plausible.
+- Summarize options and recommendation.
+- Separate fact, inference, opinion, and recommendation.
+- Do not change code unless the user asks.
+Judgment format:
+- Objective judgment: evidence, constraints, cost, feasibility, and known reality.
+- Subjective judgment: taste, product instinct, design sensibility, and likely user perception.
+- Macro view: category direction and broad movement.
+- Realistic view: what can be done soon with the current project and skill level.
+- Future view: second-order effects, durability, and what may age well or badly.
+Escalate:
+- Use `question` for direct answers that do not need sources.
+- Use `deep` only when the user asks for heavy verification.
+- Use implementation routes only after the user asks to change code.

package/references/token-budget-reporter.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Token Budget Reporter
+`yam` tracks token pressure without hooks or automatic surveillance.
+Use the CLI after a run when you want a quick budget check:
+```bash
+yam measure quick --files 3 --commands 1 --report-lines 5 --seconds 40
+yam measure ueye --files 7 --commands 2 --report-lines 18 --seconds 260
+```
+## Metrics
+- `files`: files or large snippets intentionally read.
+- `commands`: verification or inspection commands that materially shaped the answer.
+- `report-lines`: approximate final answer length in rendered lines.
+- `seconds`: rough elapsed working time.
+## Meaning
+`ok` means the run fit the route's intended weight.
+`over budget` means the route may still be valid, but the next run should either narrow scope or intentionally choose a heavier route.
+`no measurements` means no actuals were provided; use `yam budget <route>` for static limits.
+## Route Policy
+- `$quick`: should stay small. If it repeatedly exceeds budget, the request is probably `$deep` or `$mission`.
+- `$ueye`: may use visual/browser checks and nearby design context, but should avoid broad design archaeology unless the user asked for design-heavy work.
+- `$question`: should answer directly. If it needs sources or comparison, switch to `$scout`.
+- `$scout`: may gather a bounded set of high-signal sources and should separate evidence from judgment.
+- `$deep`: can exceed ordinary budgets, but the reason must be risk-tied; single-agent runtime/tmux/browser checks belong here when verification needs them.
+- `$mission`: can spend more context on real subagent/team lanes, cross-verification, doctor scan, and runtime evidence, but only for approved plans where real subagents are used or explicitly unavailable/partial.
+## Compared Baseline
+Sneakoscope would favor stronger automatic evidence collection.
+ECC would favor selective, low-context reporting.
+Karpathy-style minimal harness would remove the measurement unless it changes behavior.
+`yam` keeps manual measurement because it helps reduce over-reading without installing hooks.

package/references/token-economy.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Token Economy
+Token economy is part of quality.
+Default behavior:
+- Read `yam.project.md` first when present.
+- Start with the narrowest likely file set.
+- Prefer `rg`, file names, and local snippets over broad reading.
+- Read nearby patterns before reading whole modules.
+- Summarize large files instead of loading full context when possible.
+- Stop searching once the edit surface is clear.
+- Do not open long docs or generated files unless they are directly needed.
+- Treat `yam.project.md` as the reusable context source; inspect it with `yam pack` when it seems stale.
+- Do not produce long reports for small work.
+- Use `yam measure <route>` after real runs when budget drift needs tuning.
+Suggested read budgets:
+```text
+quick
+- Start with 1 to 3 files for patch work.
+- Allow a small module or error surface only when the first hypothesis fails or the scan points wider.
+- Prefer one focused command, two at most.
+ueye
+- Project direction, source screen, target component, nearby styles/tokens.
+- Avoid full design-system archaeology for simple tweaks.
+- Visual claims require source-screen evidence or a partial truth cap.
+question
+- 0 to 2 files.
+- Prefer current conversation context and stable knowledge.
+- Switch to scout when evidence, source freshness, or comparison matters.
+scout
+- 3 to 7 high-signal sources.
+- Prefer official docs and primary sources.
+deep
+- Wider single-agent reading is allowed, but explain why.
+- Runtime/tmux/browser context should be tied to verification evidence.
+mission
+- Wider reading is allowed only for approved real subagent/team execution.
+- Runtime/tmux/browser context should be tied to mission evidence.
+```
+Measured budgets:
+```bash
+yam measure quick --files 3 --commands 1 --report-lines 5 --seconds 40
+yam measure deep --files 18 --commands 4 --report-lines 32 --seconds 900
+```
+Escalate reading only when:
+- The first edit surface is wrong.
+- Tests or runtime output contradict the hypothesis.
+- Risk surface is broader than expected.
+- The user asks for deeper analysis.

package/references/tool-trust-layer.md ADDED Viewed

@@ -0,0 +1,113 @@
+# Tool Trust Layer
+`yam` should grow a proof-first trust layer, but not a proof-first cage.
+The goal is to help a beginner start small and then move toward professional implementation with evidence, without making every step feel like a release gate.
+This is not a lightweight-only direction. `yam-lite` is light; `yam` is progressive.
+## Depth Ladder
+Every route carries a basic contract:
+- respect project direction before changing code
+- use context packs or memory before broad re-reading
+- verify only what was actually checked
+- report remaining tasks and fix-first issues when useful
+Then depth increases by route:
+- `$question`: answer without unnecessary research.
+- `$scout`: investigate and judge without implementation by default.
+- `$quick`: implement quickly with focused verification.
+- `$ueye`: implement/review UI with real visual evidence when feasible.
+- `$deep`: prove risky single-agent work with stronger runtime/tool evidence.
+- `$mission`: coordinate real subagents/team lanes with cross-verification.
+## Surfaces To Wrap
+- Codex CLI/App readiness.
+- Browser and Computer Use availability.
+- tmux and long-running process lifecycle.
+- Subagent/team availability for `$mission`.
+- imagegen or screenshot evidence for `$ueye`.
+- Context7 or official docs for current package/API questions.
+- codex-lb or provider routing health when configured.
+- DB/Supabase safety for destructive or migration work.
+- Research, QA, wiki/context, proof, and release-gate workflows.
+## yam Policy
+Default:
+- No automatic trust gate.
+- No automatic proof loop.
+- No automatic tmux.
+- No automatic subagents.
+- No automatic external provider setup.
+- Basic direction fit and honest verification boundaries still apply.
+Advisory:
+- `yam-lite` hook may suggest routes and warn about overclaiming.
+- `yam pack` may warn about stale project direction, command drift, active hooks, or Sneakoscope surfaces.
+On demand:
+- `$quick`: smallest honest local verification.
+- `$ueye`: visual evidence and truth caps.
+- `$deep`: single-agent heavy proof, runtime/tmux/browser/process cleanup.
+- `$mission`: real subagent/team execution with cross-verification.
+- `yam tools doctor`: inspect tool readiness without changing project state.
+- `yam proof`: summarize actual evidence without running verification.
+## Borrow From Sneakoscope
+- Tool readiness checks.
+- Hook status and trust reporting.
+- DB/Supabase destructive-operation gate thinking.
+- Computer Use and screenshot evidence caps.
+- tmux/process cleanup truth.
+- Source-intelligence proof for current docs.
+- Destructive DB/Supabase command detection and production-write caution.
+- Feature/release inventory as an optional doctor, not a default gate.
+## Borrow From ECC
+- Selective install and profiles.
+- Evidence boundaries.
+- Low-context command detection.
+- Optional orchestration instead of always-on orchestration.
+## Borrow From Open Design
+- Real preview/screenshot evidence.
+- Compact design direction.
+- P0 visual gates for design-heavy work.
+## Reject
+- Always-on Team route.
+- Always-on hook enforcement.
+- Release gates for normal coding.
+- Mandatory generated image review for every UI task.
+- Broad MCP/tool scans before small work.
+- Provider config mutation without explicit user approval.
+## Implemented Shape
+`yam tools doctor` should be read-only by default:
+- report installed/available tools
+- report package scripts
+- report hook conflicts
+- report known high-risk surfaces
+- suggest route and proof level
+- do not install, configure, or mutate
+`yam proof` should summarize what actually ran:
+- command evidence
+- browser/screenshot evidence
+- runtime/tmux/process cleanup evidence
+- truth status
+- skipped/blocked/assumed items

package/references/truth-matrix.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Truth Matrix
+Use precise verification language.
+```text
+verified
+Actual command, browser check, screenshot, or test passed.
+proven
+Runtime-specific: physical or process evidence supports the claim, such as a live server response, browser check, tmux pane evidence, or verified process exit.
+partial
+Some meaningful checks passed, but not the full surface.
+fixture_only
+Fixture or mock evidence exists, but it cannot support a real runtime claim.
+fixture_instrumented_real
+A real backend or process was involved, but the run was shaped by fixture instrumentation. Do not report it as fully proven.
+integration_optional
+A live integration/runtime check was optional and not requested. This is not a failure, but it is not verified.
+real_required_missing
+The user or route required live runtime evidence, but that evidence was unavailable.
+skipped
+Verification intentionally skipped because the change is tiny or not runnable.
+blocked
+Verification could not run because of environment, tool, auth, network, or time limits.
+assumed
+Reasonable inference from code reading, but no execution evidence.
+```
+Never say `verified` when the correct status is `partial`, `skipped`, `blocked`, or `assumed`.
+Never say `proven` for runtime work unless the runtime evidence exists.
+Never say full visual `verified` when the only evidence is a reference image, generated image, code reading, or text-only critique.
+For ordinary routes, prefer `verified`, `partial`, `skipped`, `blocked`, or `assumed`.
+For `$deep` runtime verification and `$mission`, use the more precise runtime statuses when they prevent overclaiming.

package/references/ueye.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Ueye
+`ueye` is the merged UI/design route: design-heavy implementation, screenshot-led UX review, and visual QA.
+## Borrowed, With Weight Removed
+From Sneakoscope image UX review:
+- Source-screen inventory before visual claims.
+- P0-P3 issue ledger.
+- Fix P0/P1 first, then cheap local P2 issues.
+- Recheck changed or high-risk screens after fixes when feasible.
+- Cap text-only or missing-screenshot reviews as partial instead of fully verified.
+Kept out from Sneakoscope by design:
+- Mandatory generated annotated images.
+- Image voxel ledgers.
+- Release gates for every UI change.
+- Always-on proof loops.
+From Open Design:
+- Real examples and previews matter more than abstract prose.
+- Design direction should be compact and searchable.
+- P0 gates should reject placeholder visuals, generic UI, and broken responsive states.
+- UI work should be self-contained enough to inspect.
+From ECC:
+- Separate evidence from judgment.
+- Keep review output compact.
+- Escalate only when the risk justifies it.
+## Source-Screen Inventory
+Acceptable sources:
+- User-provided screenshot.
+- Browser screenshot from a local or remote URL.
+- Exported static artifact image.
+- Current screen when the browser tool can inspect it.
+- Reference image, as direction evidence only.
+- Generated annotated image, as derivative review aid only.
+Record:
+- Source type.
+- Screen or URL.
+- State: default, loading, error, empty, disabled, hover/focus, mobile, or unknown.
+- Whether visual verification is full, partial, skipped, or blocked.
+Bound:
+- Inspect 1-3 primary images by default.
+- Keep the source-screen inventory to the 5 most important rows by default.
+- Generated callout images are optional and should usually be at most one per review pass.
+- P0/P1 issues can expand the review; P2/P3 should stay top-few unless the user asks for a full polish pass.
+## P0-P3 Ledger
+- P0: blocker, unusable, impossible to complete primary workflow, severe accessibility or responsive failure.
+- P1: major issue that strongly harms conversion, comprehension, trust, or task completion.
+- P2: noticeable quality issue: spacing, alignment, hierarchy, contrast, density, state polish.
+- P3: optional polish.
+## Implementation Loop
+1. Direction fit.
+2. Source-screen inventory.
+3. Nearby pattern and token scan.
+4. Implementation.
+5. Screenshot/browser recheck when feasible.
+6. P0/P1 closeout.
+7. Truth status.
+## Truth Caps
+- Full visual verification: real source-screen evidence and relevant recheck.
+- Partial: text-only review, no screenshot, browser unavailable, or source-state gaps.
+- Blocked: requested visual claim cannot be inspected.
+- Assumed: implementation follows code patterns but visual result was not observed.
+- Reference-only or generated-only evidence cannot upgrade a result to full visual verification.

package/references/ui-quality.md ADDED Viewed

@@ -0,0 +1,23 @@
+# UI Quality
+Direction comes before decoration.
+Check:
+- What real source-screen evidence supports the visual claim?
+- Does the change match the product domain?
+- Does it fit existing components, tokens, spacing, typography, and interaction patterns?
+- Is the primary workflow easier to scan or complete?
+- Does text fit on mobile and desktop?
+- Are states covered when relevant: default, loading, error, empty, disabled, hover/focus, mobile?
+- Is color contrast sufficient?
+- Is the UI less generic after the change?
+Avoid:
+- Generic purple gradients.
+- Decorative blobs.
+- Cards inside cards.
+- Marketing layouts for operational tools.
+- New dependencies for small visual flourishes.
+- Treating reference/generated images as proof of the implemented screen.

package/references/verification-levels.md ADDED Viewed

@@ -0,0 +1,53 @@
+# Verification Levels
+## Level 0: Read/Inspect
+For copy, spacing, small CSS, docs, or no-runtime changes.
+Expected:
+- Re-read the changed file or relevant snippet.
+- Check local project fit.
+- Report that no command was run.
+## Level 1: One Relevant Check
+For small JS/TS/component changes.
+Examples:
+- One focused test.
+- Typecheck.
+- Lint for touched area.
+## Level 2: Focused Confidence
+For feature-sized work.
+Examples:
+- Typecheck plus related test.
+- Build if no narrower test exists.
+## Level 3: UI Confidence
+For visible UI changes.
+Expected when feasible:
+- Desktop screenshot or browser check.
+- Mobile viewport check.
+- Check overflow, alignment, contrast, density, and core states.
+## Level 4: Deep Confidence
+For auth, payment, DB, migration, release, broad refactor, or user-requested deep verification.
+Examples:
+- Build.
+- Test suite or relevant suites.
+- Browser QA.
+- Security check.
+- Proof summary.