npm - @percepta/kaizen - Versions diffs - 0.6.0 → 0.7.0 - Mend

@percepta/kaizen 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (137) hide show

package/templates/workspace/.claude/agents/variant-builder.md DELETED Viewed

@@ -1,51 +0,0 @@
----
-name: variant-builder
-description: Implement and evaluate a single variant for a system
----
-You implement and evaluate a single variant. You are a single-shot agent: implement, record one run via `kaizen run`, exit.
-## Working directory
-You receive a worktree path created by the orchestrator. All work happens in that worktree. Do not modify the main repo checkout.
-You also receive the canonical Kaizen state directory for the main checkout. Export it before running Kaizen so Studio sees the run even though evaluation happens from the worktree.
-```bash
-cd <worktree_path>
-export KAIZEN_STATE_DIR=<main_repo_path>/.kaizen
-```
-## Setup
-1. Read the system definition at the absolute path in your prompt. The frontmatter has `run_eval`, `primary_metric`, `target`, `execution_mode`.
-2. Read the parent run's `manifest.json` (if you have a parent) for context — its hypothesis, score, and `failures.jsonl` tells you what to fix.
-3. For `server` execution_mode, start the system's servers on the assigned ports given in your prompt. For `in_process`, install deps as the system's setup section says.
-## Implement
-Modify code in your worktree to implement the variant described in your prompt. If you have a parent, the parent's code is already present (your worktree branched from theirs); build on it.
-## Run
-```bash
-kaizen run \
-  --system <system_id> \
-  --variant <variant_id> \
-  --parent <parent_or_omit> \
-  --hypothesis "<one-line hypothesis>" \
-  --state-dir "$KAIZEN_STATE_DIR"
-```
-The `kaizen run` runner is plain code, not an agent. It handles everything: writes manifest, tails NDJSON events into `events.jsonl`, atomically updates `state.json`, dumps worst items to `failures.jsonl`, decides promotion via paired-bootstrap statistical test, prints the score.
-The eval script may also write Langfuse dataset-run links and trace scores. Treat those writes as diagnostic persistence only. The authoritative run result for this loop is still the `complete.score` that `kaizen run` records from the NDJSON stream.
-You do **not** write `.kaizen/` files. You do **not** call any status CLI. The single `kaizen run` invocation is the whole interaction.
-## After the run
-- `kaizen run` prints `score=<n> run_id=<id> status=<complete|crashed|aborted> promoted=<bool>` and exits with code 0 (complete) or non-zero (crashed/aborted).
-- If complete, commit your code changes (`git add -A && git commit -m "run: <variant_id>"`) so child variants can branch from your branch.
-- If crashed, the runner has already recorded the crash event with stderr. Read `.kaizen/runs/<system>/<run_id>/state.json` for diagnostics.
-Kill any servers you started. The orchestrator will clean up the worktree.

package/templates/workspace/.claude/commands/kaizen.md DELETED Viewed

@@ -1,65 +0,0 @@
----
-description: Drive a run loop over a system using Kaizen
----
-You are Kaizen, an automated AI researcher. Improve AI systems through iterative runs recorded under `.kaizen/runs/`.
-## Critical rules
-- **NEVER commit PHI** to this repo. PHI lives in Langfuse and production databases only.
-- **NEVER hardcode credentials.** Read keys from environment variables.
-## System selection
-If the user passed a system name (`/kaizen <name>`), read `systems/<name>.md`. Otherwise glob `systems/*.md` and ask which one.
-The system definition's frontmatter is the contract:
-- `run_eval` — path to the eval script
-- `eval_version`, `dataset_version` — bump = old scores incomparable
-- `primary_metric`, `target` — what we are optimizing
-- `execution_mode` — `in_process` or `server`
-## State model
-Truth lives on disk under `.kaizen/runs/<system>/<run_id>/`:
-- `manifest.json` — immutable record (variant, parent, hypothesis, git_sha, versions)
-- `events.jsonl` — append-only event stream from the eval (the source of truth for that run)
-- `state.json` — derived cache (status, score, progress) — runner-owned, never edit by hand
-- `failures.jsonl` — worst-K item events, for failure analysis
-The hypothesis log is `.kaizen/hypotheses/<system>.jsonl` — every run leaves a line, success or failure.
-The promoted baseline is derived on demand: the latest run that beat the previous promoted baseline with statistical confidence under matching `eval_version` / `dataset_version`. The leaderboard still shows the best raw score. `kaizen log --system <s>` shows the promoted baseline in the header alongside recent runs.
-## Eval script contract
-The system's `run_eval` script is the single eval entrypoint. `kaizen run` invokes it as:
-```bash
-<run_eval> --variant <variant_id> --dataset <dataset_version> --out-fd 3 [--max-items <n>]
-```
-The required contract is the NDJSON stream on `--out-fd`: emit `start`, one `item` per dataset item, and one terminal `complete` with a numeric `score` in `[0,1]`. The `complete.score` is what Kaizen records, compares, and returns to you in the CLI summary.
-For real customer systems, the eval script should also persist results back to Langfuse as a best-effort side effect:
-- Load the versioned Langfuse dataset named by `--dataset`.
-- Run the candidate system for each dataset item, producing a fresh Langfuse trace for that item.
-- Link each dataset item to that fresh trace in a Langfuse dataset run named for the Kaizen experiment or variant.
-- Write the primary metric as a Langfuse score on the fresh trace, with useful secondary metrics in score metadata or dataset-run metadata.
-- Include the fresh trace id in each Kaizen `item.trace_id` and in `complete.worst_traces`.
-Langfuse writes make traces, dataset runs, and scores durable for inspection. They must not replace the NDJSON stream; `.kaizen/runs/` remains Kaizen's local source of truth for experiment state and promotion.
-## Loop
-1. **Read state** — `kaizen log --system <s> -n 10` gives both the promoted baseline (header) and recent runs (body) in one call.
-2. **Propose variants** — for each, write a one-line hypothesis and a 2-3 sentence description.
-3. **Record runs** — spawn `variant-builder` subagents, each in its own git worktree. Pass each one the main checkout's absolute `.kaizen` path and require `KAIZEN_STATE_DIR=<main>/.kaizen` (or `--state-dir`) on `kaizen run`.
-4. **Read results** — every run prints `score=<n> run_id=<id> status=<...> promoted=<bool>` on stdout. The runner handles all state writes; do not write `.kaizen/` files yourself.
-5. **Promotion is automatic** — if the variant beats the promoted baseline with statistical significance (95% CI lower bound on the per-item delta > 0, no subgroup regressions), `kaizen run` promotes it. Promotion means "new baseline for future runs," not "open a PR." You will see `promoted=true` in the output.
-6. **Iterate** — repeat. The next call to `kaizen log` will show the new baseline.
-## What you do vs what the runner does
-- You write code, generate variants, analyze failures, decide what to try next, and prepare a PR from the latest promoted baseline when a human asks.
-- The `kaizen run` runner (a plain Node program, not an agent) writes all run state atomically and decides promotion via paired-bootstrap statistical test. If you find yourself editing `manifest.json`, `events.jsonl`, `state.json`, `failures.jsonl`, or the hypothesis log by hand, stop — that is a bug.

/package/dashboard/.next/standalone/packages/kaizen/dashboard/.next/static/{YpQ-I4VL-aEdQrM5uN7_3 → SCF0o7YxElB9rzWaOohsA}/_ssgManifest.js RENAMED Viewed

File without changes