npm - prizmkit - Versions diffs - 1.0.13 → 1.0.14 - Mend

prizmkit 1.0.13 → 1.0.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (77) hide show

package/bundled/skills/app-planner/assets/evaluation-guide.md ADDED Viewed

@@ -0,0 +1,44 @@
+# App Planner Evaluation Guide
+This guide is for maintainers who evaluate and iterate on the `app-planner` skill quality.
+## Evaluation & Quality Gates (Optional but Recommended)
+After multiple planning cycles or before committing refined skill logic, run standardized evaluation.
+### One-Command Evaluation
+Requires npm setup:
+```bash
+npm run skill:review -- \
+  --workspace /.codebuddy/skill-evals/app-planner-workspace \
+  --iteration iteration-N \
+  --skill-name app-planner \
+  --skill-path /core/skills/app-planner \
+  --runs 3 \
+  --grader-cmd "python3 /core/skills/app-planner/scripts/validate-and-generate.py grade --workspace {workspace} --iteration {iteration}"
+```
+Produces:
+- `benchmark.json` — quantitative metrics (pass rate, feature quality, time)
+- `benchmark.md` — human-readable summary
+- `review.html` — interactive evaluation viewer
+### Metrics Tracked
+| Metric | Computation | Target | Interpretation |
+|--------|-------------|--------|-----------------|
+| `plan_validity` | % runs with validation pass | >95% | Higher = more robust planning |
+| `avg_features_per_run` | avg feature count | ±20% consistency | Should be stable across runs |
+| `avg_acceptance_criteria` | AC count per feature | 4-6 | Target sweet spot for test coverage |
+| `dependency_complexity` | max DAG depth, cycle count | depth < 5 | Manageable dependency graph |
+| `description_quality` | word count, keyword coverage | min 20 words | Sufficient AC detail |
+| `latency_sec` | wall-clock execution time | <120s per run | UX acceptable |
+### When to Run Evaluation
+- After major SKILL.md revisions
+- Before releasing new skill updates
+- Quarterly quality assurance
+- Post-optimization to measure improvement

package/bundled/skills/app-planner/scripts/validate-and-generate.py CHANGED Viewed

@@ -7,11 +7,13 @@ Commands:
   validate    Validate an existing feature-list.json
   template    Generate a blank template feature-list.json
   summary     Print a summary table of features from a feature-list.json
+  grade       Generate grading results from eval runs (for npm run skill:review)
 Usage:
-  python3 validate-and-generate.py validate --input feature-list.json [--output validated.json]
+  python3 validate-and-generate.py validate --input feature-list.json [--output validated.json] [--mode new|incremental]
   python3 validate-and-generate.py template --output feature-list.json
   python3 validate-and-generate.py summary --input feature-list.json [--format markdown|json]
+  python3 validate-and-generate.py grade --workspace /.codebuddy/skill-evals/app-planner-workspace --iteration iteration-1
 Python 3.6+ required. No external dependencies.
 """
@@ -33,6 +35,7 @@ SCHEMA_VERSION = "dev-pipeline-feature-list-v1"
 VALID_STATUSES = {"pending", "in_progress", "completed", "failed", "skipped", "split"}
 VALID_COMPLEXITIES = {"low", "medium", "high"}
 VALID_GRANULARITIES = {"feature", "sub_feature", "auto"}
+VALID_PLANNING_MODES = {"new", "incremental"}
 FEATURE_ID_RE = re.compile(r"^F-\d{3}(-[A-Z])?$")
 SUB_FEATURE_ID_RE = re.compile(r"^F-\d{3}-[A-Z]$")
@@ -89,6 +92,7 @@ def _write_json(path, data):
 # ---------------------------------------------------------------------------
 def _detect_cycles(features):
     """Return (has_cycles: bool, max_depth: int) using Kahn's topological sort.
@@ -141,11 +145,14 @@ def _detect_cycles(features):
 # ---------------------------------------------------------------------------
-def validate_feature_list(data):
+def validate_feature_list(data, planning_mode="new"):
     """Validate a parsed feature-list data structure.
     Returns a dict with keys ``valid``, ``errors``, ``warnings``, ``stats``.
     """
+    if planning_mode not in VALID_PLANNING_MODES:
+        planning_mode = "new"
     errors = []
     warnings = []
@@ -258,7 +265,7 @@ def validate_feature_list(data):
                     label, status, ", ".join(sorted(VALID_STATUSES))
                 )
             )
-        if status and status != "pending":
+        if planning_mode == "new" and status and status != "pending":
             warnings.append(
                 "{}: status is '{}' (expected 'pending' for new plans)".format(label, status)
             )
@@ -621,7 +628,7 @@ def cmd_validate(args):
         print(json.dumps(result, indent=2, ensure_ascii=False))
         return 2
-    result = validate_feature_list(data)
+    result = validate_feature_list(data, planning_mode=args.mode)
     # Print results to stdout
     print(json.dumps(result, indent=2, ensure_ascii=False))
@@ -682,6 +689,114 @@ def cmd_summary(args):
     return 0
+def cmd_grade(args):
+    """Handle the 'grade' command for evaluation framework integration.
+    Collects validation results from eval runs and generates grading data.
+    Used by npm run skill:review for automated evaluation of app-planner.
+    """
+    workspace = getattr(args, 'workspace', None)
+    iteration = getattr(args, 'iteration', None)
+    if not workspace or not iteration:
+        _err("--workspace and --iteration are required for the grade command")
+        return 2
+    workspace_path = os.path.expanduser(workspace)
+    if not os.path.isdir(workspace_path):
+        _err("Workspace directory not found: {}".format(workspace_path))
+        return 2
+    # Collect run outputs from iteration subdirectory
+    iteration_dir = os.path.join(workspace_path, iteration)
+    if not os.path.isdir(iteration_dir):
+        _err("Iteration directory not found: {}".format(iteration_dir))
+        return 2
+    # Find all eval-* subdirectories
+    eval_dirs = []
+    try:
+        for item in os.listdir(iteration_dir):
+            item_path = os.path.join(iteration_dir, item)
+            if os.path.isdir(item_path) and item.startswith('eval-'):
+                eval_dirs.append((item, item_path))
+    except Exception as exc:
+        _err("Failed to list iteration directory: {}".format(exc))
+        return 2
+    if not eval_dirs:
+        _warn("No eval-* directories found in {}".format(iteration_dir))
+    grades = []
+    for eval_name, eval_path in sorted(eval_dirs):
+        output_json = os.path.join(eval_path, "outputs", "feature-list.json")
+        if not os.path.isfile(output_json):
+            _info("Skipping {}: no output/feature-list.json found".format(eval_name))
+            continue
+        # Load and validate the output
+        data, load_err = _load_json(output_json)
+        if load_err:
+            _warn("Failed to load {}: {}".format(output_json, load_err))
+            continue
+        # Run validation
+        result = validate_feature_list(data, planning_mode="new")
+        # Create grading entry
+        grade_entry = {
+            "test_name": eval_name,
+            "passed": result["valid"],
+            "assertions": [
+                {
+                    "name": "Feature list valid schema",
+                    "passed": result["valid"],
+                    "evidence": "valid={}".format(result["valid"])
+                },
+                {
+                    "name": "No cycles in DAG",
+                    "passed": not result["stats"].get("has_cycles", False),
+                    "evidence": "cycles={}".format(result["stats"].get("has_cycles", False))
+                },
+                {
+                    "name": "Features generated",
+                    "passed": result["stats"].get("total_features", 0) > 0,
+                    "evidence": "count={}".format(result["stats"].get("total_features", 0))
+                },
+                {
+                    "name": "No validation errors",
+                    "passed": len(result.get("errors", [])) == 0,
+                    "evidence": "error_count={}".format(len(result.get("errors", [])))
+                }
+            ]
+        }
+        grades.append(grade_entry)
+        # Write grading.json to eval run directory
+        grading_file = os.path.join(eval_path, "grading.json")
+        _write_json(grading_file, grade_entry)
+        _info("Wrote grading to {}".format(grading_file))
+    # Write aggregated results
+    aggregated = {
+        "iteration": iteration,
+        "total_runs": len(grades),
+        "passed_runs": sum(1 for g in grades if g["passed"]),
+        "pass_rate": len([g for g in grades if g["passed"]]) / len(grades) if grades else 0,
+        "grades": grades
+    }
+    benchmark_file = os.path.join(iteration_dir, "benchmark.json")
+    _write_json(benchmark_file, aggregated)
+    _info("Wrote aggregated benchmark to {}".format(benchmark_file))
+    return 0
 def main():
     parser = argparse.ArgumentParser(
         description="Validate and generate feature-list.json files for the dev-pipeline system.",
@@ -689,6 +804,7 @@ def main():
         epilog=(
             "Examples:\n"
             "  %(prog)s validate --input feature-list.json\n"
+            "  %(prog)s validate --input feature-list.json --mode incremental\n"
             "  %(prog)s validate --input feature-list.json --output validated.json\n"
             "  %(prog)s template --output feature-list.json\n"
             "  %(prog)s summary --input feature-list.json\n"
@@ -709,6 +825,12 @@ def main():
     p_validate.add_argument(
         "--output", help="Path to write validated output (optional)"
     )
+    p_validate.add_argument(
+        "--mode",
+        choices=["new", "incremental"],
+        default="new",
+        help="Validation mode (default: new)",
+    )
     # -- template --
     p_template = subparsers.add_parser(
@@ -734,6 +856,22 @@ def main():
         help="Output format (default: markdown)",
     )
+    # -- grade --
+    p_grade = subparsers.add_parser(
+        "grade",
+        help="Generate grading results from eval runs (for npm run skill:review)",
+    )
+    p_grade.add_argument(
+        "--workspace",
+        required=True,
+        help="Path to eval workspace (e.g., /.codebuddy/skill-evals/app-planner-workspace)",
+    )
+    p_grade.add_argument(
+        "--iteration",
+        required=True,
+        help="Iteration ID (e.g., iteration-1)",
+    )
     args = parser.parse_args()
     if not args.command:
@@ -744,6 +882,7 @@ def main():
         "validate": cmd_validate,
         "template": cmd_template,
         "summary": cmd_summary,
+        "grade": cmd_grade,
     }
     handler = dispatch.get(args.command)

package/bundled/skills/bug-planner/SKILL.md CHANGED Viewed

@@ -57,6 +57,22 @@ Output: `project_name`, `project_description`, `global_context` fields populated
 Accept bug information in ANY of these formats (auto-detect):
+#### Severity Auto-Classification Rules
+When extracting bugs, apply these rules to auto-suggest severity:
+| Severity | Indicators | Examples |
+|----------|------------|----------|
+| **critical** | System crash, data loss, security breach, OOM, unrecoverable error | `Segmentation fault`, `OutOfMemoryError`, `SQL injection vulnerability`, `Database corrupted` |
+| **high** | Core feature broken, authentication failure, data integrity issue, timeout | `Auth token invalid`, `Payment failed`, `Connection timeout`, `500 Internal Server Error` |
+| **medium** | Feature partially broken, workaround exists, incorrect output | `CSV encoding issue`, `Pagination not working`, `Wrong date format`, `Missing validation` |
+| **low** | Cosmetic issue, minor inconvenience, edge case | `UI misalignment`, `Typo in error message`, `Slow loading (non-critical page)`, `Non-breaking warning` |
+**Special cases:**
+- Failed test → medium (unless test covers critical path, then high)
+- User report with "cannot use app" → high
+- User report with "annoying but works" → low
 #### Format A: Stack Trace / Error Log
 ```
 TypeError: Cannot read property 'token' of null
@@ -124,18 +140,47 @@ ALERT: Error rate spike: 500 errors/min on /api/login endpoint
 ### Phase 4: Generate & Validate
 1. **Generate `bug-fix-list.json`**: Conform to `dev-pipeline/templates/bug-fix-list-schema.json`
-2. **Validate against schema**: Auto-run validation
+2. **Validate against schema**: Run the validation checks below
 3. **Write file** to project root (or user-specified path)
-4. **Output**: File path, summary, and next steps:
-   ```
-   ✅ bug-fix-list.json generated with 3 bugs (1 critical, 1 medium, 1 low)
-   Next steps:
-   - Review: cat bug-fix-list.json
-   - Start fixing: say "开始修复" or "start fixing bugs" to launch the bugfix pipeline
-   - Or run directly: ./dev-pipeline/launch-bugfix-daemon.sh start bug-fix-list.json
-   - Fix one interactively: invoke prizmkit-bug-fix-workflow for each bug
-   ```
+4. **Output**: File path, summary, and next steps
+#### Schema Validation Checklist
+Before writing the file, verify all items pass:
+**Required fields:**
+- [ ] `$schema`: must be `"dev-pipeline-bug-fix-list-v1"`
+- [ ] `project_name`: non-empty string
+- [ ] `bugs`: non-empty array
+**Per-bug required fields:**
+- [ ] `id`: matches pattern `B-NNN` (e.g., `B-001`)
+- [ ] `title`: non-empty string
+- [ ] `description`: non-empty string
+- [ ] `severity`: one of `critical`, `high`, `medium`, `low`
+- [ ] `error_source.type`: one of `stack_trace`, `user_report`, `failed_test`, `log_pattern`, `monitoring_alert`
+- [ ] `verification_type`: one of `automated`, `manual`, `hybrid`
+- [ ] `acceptance_criteria`: non-empty array of strings
+- [ ] `status`: must be `pending` for new bugs
+**Consistency checks:**
+- [ ] No duplicate bug IDs
+- [ ] No duplicate priorities (each bug should have unique priority number)
+- [ ] If `affected_feature` is set, verify it exists in `feature-list.json` (if available)
+If any check fails, fix before writing the file.
+#### Success Output
+```
+✅ bug-fix-list.json generated with 3 bugs (1 critical, 1 medium, 1 low)
+Next steps:
+- Review: cat bug-fix-list.json
+- Start fixing: say "开始修复" or "start fixing bugs" to launch the bugfix pipeline
+- Or run directly: ./dev-pipeline/launch-bugfix-daemon.sh start bug-fix-list.json
+- Fix one interactively: invoke bug-fix-workflow for each bug
+```
 ---
@@ -150,7 +195,7 @@ Batch-parse error logs to generate bug entries without interactive prompts:
 3. Auto-generate bug entries with:
    - Title: first line of error message
    - Description: full error context
-   - Severity: auto-classify (crash/OOM=critical, auth/timeout=high, validation=medium, other=low)
+   - Severity: use the **Severity Auto-Classification Rules** (see Phase 2)
    - error_source: populated from log content
    - verification_type: default to `automated`
    - acceptance_criteria: auto-generate "Error no longer occurs in [scenario]"
@@ -210,7 +255,7 @@ After `bug-fix-list.json` is generated, the user can:
 1. **Say "开始修复" or "start fixing bugs"** — triggers `bugfix-pipeline-launcher` skill to auto-launch pipeline in background (recommended)
 2. **Background daemon**: `./dev-pipeline/launch-bugfix-daemon.sh start bug-fix-list.json`
 3. **Foreground run**: `./dev-pipeline/run-bugfix.sh run bug-fix-list.json`
-4. **Fix single bug interactively**: invoke `prizmkit-bug-fix-workflow` in current session
+4. **Fix single bug interactively**: invoke `bug-fix-workflow` in current session
 5. **Retry a failed bug**: `./dev-pipeline/retry-bug.sh B-001`
 ## Error Handling

package/bundled/skills/bugfix-pipeline-launcher/SKILL.md CHANGED Viewed

@@ -36,7 +36,7 @@ Launch the autonomous bug fix pipeline from within a cbc conversation. The pipel
 **Do NOT use this skill when:**
 - User wants to plan/collect bugs (use `bug-planner` instead)
-- User wants to fix a single bug interactively in current session (use `prizmkit-bug-fix-workflow`)
+- User wants to fix a single bug interactively in current session (use `bug-fix-workflow`)
 - User wants to launch the feature pipeline (use `dev-pipeline-launcher`)
 ### Prerequisites
@@ -225,13 +225,11 @@ When user says "retry B-001" or "重试 B-001":
 dev-pipeline/retry-bug.sh B-001 bug-fix-list.json
 ```
-Or within the main pipeline (reset + resume):
+**Note:** `retry-bug.sh` automatically cleans bug artifacts and resets status before retrying. This is equivalent to `reset-feature.sh --clean --run` in the feature pipeline. No separate reset command is needed.
+Environment variables (optional):
 ```bash
-python3 dev-pipeline/scripts/update-bug-status.py \
-  --bug-list bug-fix-list.json \
-  --state-dir dev-pipeline/bugfix-state \
-  --bug-id B-001 --action reset
-# Then restart pipeline to pick it up
+SESSION_TIMEOUT=3600 dev-pipeline/retry-bug.sh B-001 bug-fix-list.json
 ```
 ### Error Handling

package/bundled/skills/dev-pipeline-launcher/SKILL.md CHANGED Viewed

@@ -1,17 +1,17 @@
 ---
 name: "dev-pipeline-launcher"
-description: "Launch and manage the dev-pipeline from within a cbc session. Start pipeline in background, monitor logs, check status, stop pipeline. Invoke when user wants to start building features, run the pipeline, or check pipeline progress. (project)"
+description: "Launch and manage the dev-pipeline from within an AI CLI session. Start pipeline in background, monitor logs, check status, stop pipeline. Invoke when user wants to start building features, run the pipeline, or check pipeline progress. (project)"
 ---
 # Dev-Pipeline Launcher
-Launch the autonomous development pipeline from within a cbc conversation. The pipeline runs as a fully detached background process -- closing the cbc session does NOT stop the pipeline.
+Launch the autonomous development pipeline from within an AI CLI conversation. The pipeline runs as a fully detached background process -- closing the AI CLI session does NOT stop the pipeline.
 ### Mandatory Execution Mode (MUST)
 - Always use daemon mode via `dev-pipeline/launch-daemon.sh` for start/stop/status/log actions.
 - NEVER run `dev-pipeline/run.sh run ...` directly from this skill.
-- Reason: foreground `run.sh` can be terminated by AI CLI command timeout (e.g. cbc 120s), while daemon mode survives session timeout.
+- Reason: foreground `run.sh` can be terminated by AI CLI command timeout (e.g. cbc 120s, claude may vary), while daemon mode survives session timeout.
 ### When to Use
@@ -49,11 +49,12 @@ Before any action, validate:
 1. **dev-pipeline exists**: Confirm `dev-pipeline/launch-daemon.sh` is present and executable
 2. **For start**: `feature-list.json` must exist in project root (or user-specified path)
-3. **Dependencies**: `jq`, `python3`, `cbc` must be in PATH
+3. **Dependencies**: `jq`, `python3`, AI CLI (`cbc` or `claude`) must be in PATH
+4. **Python version**: Requires Python 3.8+ for dev-pipeline scripts
 Quick check:
 ```bash
-command -v jq && command -v python3 && command -v cbc && echo "All dependencies OK"
+command -v jq && command -v python3 && (command -v cbc || command -v claude) && echo "All dependencies OK"
 ```
 If `feature-list.json` is missing, inform user:
@@ -227,8 +228,14 @@ When user says "从头重试 F-003" or "clean retry F-003":
 dev-pipeline/reset-feature.sh F-003 --clean --run feature-list.json
 ```
+Environment variables (optional):
+```bash
+SESSION_TIMEOUT=3600 dev-pipeline/retry-feature.sh F-003 feature-list.json
+```
 Notes:
 - `retry-feature.sh` runs exactly one feature session and exits.
+- `reset-feature.sh --clean --run` clears the feature state before retrying (fresh start).
 - Keep pipeline daemon mode for main run management (`launch-daemon.sh`).
 ### Error Handling
@@ -237,17 +244,19 @@ Notes:
 |-------|--------|
 | `feature-list.json` not found | Tell user to run `app-planner` skill first |
 | `jq` not installed | Suggest: `brew install jq` |
-| `cbc` not in PATH | Check CodeBuddy CLI installation |
+| `cbc`/`claude` not in PATH | Check AI CLI installation |
 | Pipeline already running | Show status, ask if user wants to stop and restart |
 | PID file stale (process dead) | `launch-daemon.sh` auto-cleans, retry start |
 | Launch failed (process died immediately) | Show last 20 lines of log: `tail -20 dev-pipeline/state/pipeline-daemon.log` |
+| Feature stuck/blocked | Use `retry-feature.sh <F-XXX>` to retry; use `reset-feature.sh <F-XXX> --clean --run` for fresh start |
 | All features blocked/failed | Show status, suggest daemon-safe recovery: `dev-pipeline/reset-feature.sh <F-XXX> --clean --run feature-list.json` |
 | Permission denied on script | Run `chmod +x dev-pipeline/launch-daemon.sh dev-pipeline/run.sh` |
 ### Integration Notes
 - **After app-planner**: This is the natural next step. When user finishes planning and has `feature-list.json`, suggest launching the pipeline.
-- **Session independence**: The pipeline runs completely detached. User can close cbc, open a new session later, and use this skill to check progress or stop the pipeline.
+- **Session independence**: The pipeline runs completely detached. User can close the AI CLI session, open a new session later, and use this skill to check progress or stop the pipeline.
 - **Single instance**: Only one pipeline can run at a time. The PID file prevents duplicates.
+- **Pipeline coexistence**: Feature and bugfix pipelines use separate state directories (`state/` vs `bugfix-state/`), so they can run simultaneously without conflict.
 - **State preservation**: Stopping and restarting the pipeline resumes from where it left off -- completed features are not re-run.
 - **HANDOFF**: After pipeline completes all features, suggest running `prizmkit-code-review` for overall review, or `prizmkit-summarize` to archive.